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This  book  features  contributions  stemming  from  the  activities  of  the  ENERGIC 
(European  Network  Exploring  Research  into  Geospatial  Information  Crowd- 
sourcing: software  and  methodologies  for  harnessing  geographic  information 
from  the  crowd)  scientific  network.  Researchers  from  23  European  countries 
participate  in  ENERGIC.  It  is  funded  as  action  IC1203  by  the  COST  (Coopera- 
tion in  Science  and  Technology)  programme,  which  is  a European  framework 
supporting  trans-national  cooperation  among  scientists,  engineers,  and  schol- 
ars across  Europe. 

The  ENERGIC  network  was  born  out  of  scientific  connections  in  the  area  of  Geo- 
graphic Information  Science  and  friendships  that  can  be  traced  back  over  20  years 
ago.  Indeed,  the  first  important  event  was  the  specialist  meeting  on  Volunteered 
Geographic  Information  (VGI)  organised  in  2007  in  Santa  Barbara  (California) 
under  the  auspices  of  NCGIA,  Los  Alamos  National  Laboratory,  the  Army 
Research  Office  and  The  Vespucci  Initiative  (www.vespucci.org).  A number  of 
fundamental  questions  were  examined  at  this  meeting  and  the  results  showed 
that  VGI  was  a field  of  great  potential,  but  lacking  methodological  and  functional 
developments  (NCGIA  2007). 


How  to  cite  this  book  chapter: 

Capineri,  C,  Haklay,  M,  Huang,  H,  Antoniou,  V,  Kettunen,  J,  Ostermann,  F and  Purves,  R. 
2016.  Introduction.  In:  Capineri,  C,  Haklay,  M,  Huang,  H,  Antoniou,  V, 

Kettunen,  J,  Ostermann,  F and  Purves,  R.  (eds.)  European  Handbook  of 
Crowdsourced  Geographic  Information,  Pp.  1-11.  London:  Ubiquity  Press. 
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Five  years  later,  in  2012,  the  ENERGIC  action  started  exploring  new  VGI 
sources,  sharing  and  developing  data  retrieval  software,  assessing  VGI  quality, 
defining  standardization  criteria  for  interoperability  with  other  datasets,  iden- 
tifying applications  and  transferring  them  for  business  implementation  (mar- 
ket analysis,  risk  management,  advertising,  etc.1). 

The  action  is  based  on  the  study  of  the  remarkable  new  source  of  geographic 
information  that  has  become  available  in  the  form  of  user-generated  content 
accessible  over  the  Internet.  People  now  consume  and  produce  geographic 
information  on  the  go  via  platforms  such  as  Facebook,  Twitter,  Flickr,  Ins- 
tagram  and  others.  The  availability  of  cheap  GPS  allows  everyone  to  survey 
and  map  and  contribute  to  projects  like  Wikimapia  and  OpenStreetMap.  The 
exploitation,  integration  and  application  of  these  sources,  termed  crowd- 
sourced or  user  generated  information,  offer  to  multidisciplinary  scientists 
an  unprecedented  opportunity  to  conduct  research  on  a variety  of  topics  at 
multiple  scales. 

The  most  popular  definition  of  such  content  that  possesses  a geographic 
reference  data  is  Volunteered  Geographic  Information  (VGI),  first  coined  by 
Michael  Goodchild  in  2007:  its  success  is  revealed  by  the  growing  number  of 
articles  published  since  2007.  By  simply  searching  Google  scholar  for  refer- 
ences which  match  the  term  Volunteered  geographic  information  from  2007 
to  2014  an  interesting  trend  emerges:  from  83  articles  in  2007  to  1,720  in 
2014  (Fig  1)! 

The  growing  volume  of  scientific  production  on  the  topic  cover  multiple 
domains  but  some  major  threads  may  be  identified,  although  intertwined  and 
often  coexisting,  to  build  a narrative  on  the  development  of  VGI.  After  an  ini- 
tial phase  concerned  mostly  with  conceptualizing  and  defining  the  new  phe- 
nomenon (Coleman  2010;  Elwood,2008;  Capineri  & Rondinone  2011;  See  et  al. 
2016;  Sui  et  al.  2012)  and  types  of  participation  (Bonney  et  al.  2009;  Coleman, 
Georgiadou  & Labonte  2009;  Haklay2010;  Haklay2013;  Goodchild  & Li  2012), 
a first  relevant  thread  in  the  literature  is  dedicated  to  the  critical  aspects  of 
quality  (Ali  & Schmidt  2014;  Antoniou  2016;  Foody  et  al.  2013),  among  which 
accuracy  and  precision  of  geo-location  and  of  observations,  completeness  and 
intelligibility  of  contents,  as  well  as  the  reliability  of  information  and  the  trust- 
worthiness of  the  data  source  (Bishr&  Kuhn.  2007;  Bishr  & Janowicz  2010)  and 
at  the  same  time,  the  first  applications  and  experimentations  appear  and  show 
the  use  of  VGI  in  natural  disaster  management  (Zook  et  al.  2010;  Goodchild  & 
Glennon  2010;  Ostermann  & Spinsanti.  2012,  Spinsanti  & Ostermann  2013), 
land  use  (Antoniou  et  al.  2016;  Perger  et  al.  2012),  tourism  (Girardin  et  al. 
2008;  Sun  et  al.  2013;Teobaldi  & Capineri  2014  ),  environmental  monitoring 
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Figure  1:  Articles  on  Google  Scholar  on  Volunteered  geographic  information 
(2007-2015  January). 

Source:  Google  Scholar  (January  2015). 


(Gouveia  & Fonseca  2008;  Connors  et  al.  2012  ),  the  integration  of  VGI  and 
spatial  data  infrastructures  (Craglia  2007;  McDougall  2009),  the  relationship 
with  the  GIS  world  (Kuhn  2007;  Goochild  2016)  and  mapping  (in  particular 
with  reference  to  the  OpenStreetMap  project)  (Neis  & Zipf  2012;  Dodge  & 
Kitchin  2013)  and  finally  ontology  (Tardy  et  al.  2016).  Another  relevant  thread 
addresses  the  role  of  VGI  in  the  inner  worlds  of  geography  such  as  place  (Cor- 
bett 2013;  Purves  & Edwardes  2008;  Hardy  et  al.  2012;  Hecht  & Gergle  2010; 
Ostermann  et  al.  2015;  Purves  & Derungs  2015),  the  definition  of  Vague  places 
like  downtown,  livelihoods  and  vernacular  geography  (Hollenstein  & Purves 
2010);  the  dynamics  of  urban  cores  (Aubrecht  2011;  Jiang  & Jia  201 1;  Sagl  et  al. 
2012)  and  space-time  relationships  (Li,  Goodchild,  Xu,  2013).  More  recently 
analysis  on  the  societal  implications  of  ICT  have  employed  VGI  to  discuss  the 
digital  divide  metaphor  (Elwood  2010;  Elwood  et  al.2012;  Graham  et  al.  2014).2 


2 The  references  mentioned  above  are  only  a small  selection  and  certainly  do  not  pay  justice 
to  all  the  valuable  scholars  involved  in  the  debate  and  research  on  VGI.  Apologies  for  all 
the  authors  that  have  not  been  quoted  despite  being  relevant.  The  contents  of  this  essay  also 
draw  from  the  activities  of  the  COST  Action  IC1203  ENERGIC,  European  Network  Exploring 
research  into  Geospatial  Information  Crowdsourcing,  which  the  authors  belong  to. 
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In  conclusion,  a very  wide  panorama  which  proves  the  broad  penetration 
of  VGI  in  the  scientific  community  and  the  diverse  research  paths. 

The  collection  of  the  papers  in  this  book  touch  on  many  of  these  threads. 


The  Book  Structure 

The  book  includes  peer-reviewed  chapters,  organised  in  six  parts,  which  try  to 
address  some  fundamental  questions:  what  motivates  citizens  to  provide  such 
information  in  the  public  domain,  and  what  factors  govern/predict  its  valid- 
ity? What  methods  might  be  used  to  validate  such  information?  Can  VGI  be 
framed  within  the  larger  domain  of  sensor  networks,  in  which  inert  and  static 
sensors  are  replaced  by,  or  combined  with,  intelligent  and  mobile  humans? 
What  limitations  are  imposed  on  VGI  by  differential  access  to  broadband  Inter- 
net, mobile  phones  and  other  communication  technologies,  and  by  concerns 
over  privacy?  How  do  VGI  and  crowdsourcing  enable  innovation  applications 
to  benefit  human  society? 


PART  I:  Theoretical  and  social  aspects 

Part  I deals  with  the  nature  and  features  of  crowdsourced  geographic  infor- 
mation: the  different  sources  and  typologies  of  VGI,  the  fundamental  aspect 
of  participation.  Capineri  (Chapter  2)  discusses  the  main  features  of  crowd- 
sourced geographic  information  by  focusing  on  the  components  of  such  data: 
the  geographical  reference,  the  contents  and  the  producers  profile  in  order 
to  show  the  potentialities  and  critical  aspects  posed  by  these  sources.  Haklay 
(Chapter  3)  looks  at  participation  inequality  and  explains  how  it  emerges  in 
VGI  and  citizen  science  projects  at  both  temporal  and  spatial  scales,  and  also 
evaluates  its  implication  on  the  use  of  VGI  and  citizen  science  data.  Campagna 
(Chapter  4)  introduces  the  concept  of  Social  Media  Geographic  Information 
(SMGI)  as  a specific  type  of  VGI  for  expressing  pluralism  in  such  domains  as 
spatial  planning  and  governance. 


PART  II:  Quality:  Criteria  and  methodologies 

This  part  covers  issues  relevant  to  the  quality  evaluation  of  VGI  datasets.  The 
evaluation  of  spatial  data  quality  elements  and  the  development  and  adoption 
of  new  quality  criteria  and  methodologies  for  VGI  is  presented  here.  Criscuolo 
et  al.  (Chapter  5)  offer  an  analysis  of  strategies  for  quality  control  and  describe  a 
simple  representation  of  the  components  of  quality  in  crowdsourced  geographic 
information.  Jacobs  (Chapter  6)  explores  methods  of  (semi)automatic  valida- 
tion of  observation  data  especially  in  field  of  citizen  science  projects.  Ballatore 
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(Chapter  7)  provides  an  overview  of  the  semantic  issues  experienced  in  VGI 
and  what  potential  solutions  are  emerging  from  research  in  geo -semantics  and 
in  the  Semantic  Web.  Antoniou  et  al.  (Chapter  8)  present  how  named  features 
in  Open  Street  Map  behave  and  change  in  terms  of  type,  name  and  location  by 
analysing  the  volatility  of  places  and  points-of-interest  (POIs).  Ali  (Chapter  9) 
presents  an  approach  for  rule-guided  classification  for  VGI  projects  which  con- 
sists in  a learning  and  a guiding  phase.  Finally,  Bucher  et  al.  (Chapter  10)  pre- 
sent findings  from  the  literature  and  from  the  experience  of  the  French  National 
Mapping  Agency,  IGN,  about  quality  management  of  geographical  data  and 
focus  on  the  potential  of  context  and  tasks  modelling  to  address  quality  issues 


PART  III:  Data  analytics 

Part  III  focuses  on  data  analytics  and  visualization  methods  for  deriving 
knowledge  from  varying  VGI  sources  such  as  Twitter,  Flickr  and  OpenStreet- 
Map.  Purves  and  Mackaness  (Chapter  11)  provide  an  overview  of  methodolo- 
gies used  to  extract  meaning  from  the  analysis  of  geotagged  images  based  on 
research  in  natural  language  processing  and  statistical  and  exploratory  tech- 
niques. Gennady  and  Natalia  Andrienko  (Chhapter  12)  offer  an  analysis  of  geo- 
graphically referenced  posts  published  in  social  media,  such  as  Twitter,  Flickr 
and  YouTube  and  an  overview  of  visual  analytics  approaches  to  extracting  vari- 
ous kinds  of  information  and  knowledge.  Jiang  (Chapter  13)  proposes  the  head / 
tail  breaks  is  a powerful  tool  for  visualizing  city  structures  and  dynamics,  and 
uses  social  media  location  data  to  illustrate  its  effectiveness  in  visualization. 
Lemmens  et  al.  (Chapter  14)  present  ways  to  enrich  the  unstructured  nature  of 
VGI  through  semantic  enrichment  and  explain  how  folksonomies  and  ontolo- 
gies play  a fundamental  role.  Karagoz  et  al.  (Chapter  15)  focus  on  detecting 
real-world  events  by  following  posts  in  microblogs  by  employing  a process 
for  toponym  recognition  and  location  estimation.  Song  and  Xia  (Chapter  16) 
concentrate  on  spatio-temporal  variation  of  sentiment  polarity  patterns  of 
georeferenced  Tweets,  with  a view  to  understanding  how  opinions  evolve  on 
Twitter  over  space  and  time  and  across  communities  of  users.  Stojanocski  et  al. 
(Chapter  17)  present  a methodology  for  detecting  and  identifying  social  hot- 
spots from  Twitter  stream  data  and  applying  sentiment  analysis  on  the  data  in 
New  York.  Finally,  Steiger  et  al.  (Chapter  18)  provide  a state  of  the  art  survey  on 
social  media  data  analysis  and  mining  from  a GIScience  perspective. 


PART  IV:  VGI  and  crowdsourcing  in  environmental  monitoring 

Part  IV  presents  several  case  studies  where  crowdsourced  geographic  infor- 
mation has  been  applied  especially  in  the  field  of  environmental  monitoring. 
Kettunen  et  al.  (Chapter  19)  describe  several  examples  to  illustrate  the  changing 
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role  of  citizens  in  environmental  monitoring  in  Finland.  Jokar  Arsanjani  and 
Fonte  (Chapter  20)  investigate  the  completeness,  thematic  accuracy  and  fit- 
ness for  use  of  OpenStreetMap  features  for  land  mapping  purposes  in  Europe. 
Lupia  and  Estima  (Chapter  21)  evaluate  the  adequacy  of  the  geotagged  pho- 
tos available  at  Panoramio  for  monitoring  Land  Use/Cover  (LULC)  in  urban 
areas,  taking  Rome  (Italy)  as  the  study  area.  Oltra  et  al.  (Chapter  22)  introduce 
AtrapaelTigre.com,  a citizen  science  project  focusing  on  the  Asian  tiger  mos- 
quito in  Spain,  and  describe  lessons  they  have  learned.  De  Albuquerque  et  al. 
(Chapter  23)  close  the  section  by  reviewing  the  use  of  crowdsourced  geographic 
information  for  disaster  risk  management  and  improving  urban  resilience  and 
suggesting  future  research  directions. 


PART  V:  VGI  in  mobility  applications 

Part  V includes  several  case  studies  and  research  activities  of  VGI  in  smart 
cities  and  mobility  applications.  Zipf  et  al.  (Chapter  24)  investigate  the  use  of 
OpenStreetMap  for  routing  and  navigation  for  mobility-impaired  persons  and 
describe  existing  challenges  in  this  aspect.  Farkas  (Chapter  25)  introduces  a 
crowdsourcing  based  smart  timetable  service  for  public  transportation.  Lendak 
(Chapter  26)  reviews  existing  mobile  crowdsourcing  projects  in  smart  city, 
focusing  on  environmental  monitoring,  citizen  collaboration,  urban  mobility, 
health/fitness  and  social  networking.  Finally,  Stojanovic  et  al.  (Chapter  27)  pre- 
sent a framework  and  some  applications  to  illustrate  how  mobile  crowdsourc- 
ing can  be  used  for  enabling  smart  urban  mobility. 


PART  VI:  VGI  in  spatial  planning 

Part  V includes  a broad  variety  of  case  studies  and  research  activities  of  VGI  in 
spatial  planning.  Huang  and  Gartner  (Chapter  28)  illustrate  how  mobile  crowd- 
sourcing and  social  media  data  can  be  used  to  study  people  s affective  responses 
to  different  environments,  as  well  as  the  potential  applications  of  these  affective 
data.  Massa  and  Campagna  (Chapter  29)  introduce  Spatext,  a tool  that  allows 
integration  of  VGI  and  authoritative  data,  and  present  a case  study  to  illustrate 
its  application  in  urban  planning.  Basiouka  and  Potsiou  (Chapter  30)  inves- 
tigate the  use  of  VGI  and  crowdsourcing  in  Cadastre  design,  and  propose  a 
crowdsourcing  cadastral  model  for  official  cadastral  surveys.  And  in  the  last 
chapter,  Fan  and  Zipf  (Chapter  31)  investigate  the  generation  of  3D  city  models 
by  using  OpenStreetMap  data  and  introduce  the  OSM-3D  project. 

In  addition,  the  book  consolidates  the  references  and  information  from  all  the 
chapters  in  a rich  bibliography  and  a glossary,  thereby  providing  a state-of-the- 
art  of  research  on  crowdsourced  and  volunteered  geographic  information,  as 
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well  as  valuable  reference  for  PhD  candidates  and  senior  researchers  from  many 
disciplines  who  aim  to  tap  into  the  potential  of  new  geographic  data  sources. 


Concluding  Remarks 

As  this  collection  demonstrates,  research  into  crowdsourced  geographic  infor- 
mation or  VGI  (depending  on  the  definition  that  the  researchers  prefer),  has 
made  significant  inroads  in  a relatively  short  period  of  7 or  8 years.  Yet  the  book 
opens  questions  and  points  to  new  research  directions,  in  addition  to  the  find- 
ings that  each  of  the  researchers  demonstrate. 

As  can  be  seen  from  this  book,  the  crowdsourcing  techniques  and  methods 
and  the  VGI  phenomenon  have  motivated  a multidisciplinary  research  commu- 
nity to  identify  both  fields  of  applications  and  quality  criteria  depending  on  the 
use  of  VGI.  Besides  harvesting  tools  and  storage  of  these  data,  many  research 
attentions  have  been  paid  to  these  information  resources,  in  an  age  when  infor- 
mation is  one  of  the  most  important  drivers  of  development.  The  participation 
component  is  a fundamental  aspect  of  crowdsourced  information  and  it  reveals 
both  a new  way  of  doing  science  with  a problem-solving  approach. 

Despite  rapid  progress  in  VGI  research,  this  book  also  shows  that  there  are  tech- 
nical, social,  political  and  methodological  challenges  that  require  further  studies 
and  research.  We  hope  that  the  book  will  spark  new  research  questions  and  devel- 
opment— and  hopefully  foster  new  research  collaborations,  and  friendships. 
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Abstract 

This  contribution  starts  from  the  assumption  that  volunteered  geographic 
information  is  a technological,  cultural  and  scientific  innovation.  It  therefore 
offers  first  some  general  background  on  the  context  that  has  fuelled  the  devel- 
opment of  VGI  and  the  lively  scientific  debates  that  have  accompanied  its  suc- 
cess. The  paper  then  focuses  on  the  nature  of  this  data  by  describing  the  main 
elements  of  VGI:  the  geographical  reference  (coordinates,  geotag,  etc.),  the 
contents  (texts,  images,  etc.)  and  the  producers  profiles.  The  opportunities  and 
the  criticalities  offered  by  this  data  are  described  with  examples  drawn  from 
recent  literature  and  applications  to  highlight  both  the  research  challenges  and 
the  current  state  of  the  subject.  The  chapter  aims  to  provide  a guide  to  and  a 
reference  picture  of  this  rapidly  evolving  subject. 

Keywords 

volunteered  geographic  information,  crowdsourced  information,  geography 

Introduction:  a technological  and  cultural  innovation. 

The  most  cited  and  debated  definition  of  volunteered  geographic  information 
was  coined  in  2007  by  Michael  Goodchild  (2007)  as  a subset  of  user-generated 
content  which  carries  specific  spatial  and  temporal  components: 
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‘the  widespread  engagement  of  large  numbers  of  private  citizens,  often 
with  little  in  the  way  of  formal  qualifications,  in  the  creation  of  geo- 
graphic information,  a function  that  for  centuries  has  been  reserved 
to  official  agencies.  [..]  I term  this  volunteered  geographic  information 
(VGI),  a special  case  of  the  more  general  Web  phenomenon  of  user- 
generated content' (p.  2). 

Since  2007,  the  appeal  of  VGI  has  grown  steadily  and  created  a wide  scien- 
tific community  involved  in  the  harnessing  of  these  new  sources  of  geographi- 
cal information  and  in  satisfying  the  spatial  shift  fuelled  by  the  neogeography 
revolution  which  has  put  mapping  within  the  grasp  of  almost  any  user  who  is 
not  only  in  possession  of  suitable  technology  (e.g.  a smartphone),  but  is  capable 
of  configuring  it  in  order  to  capture  location  and  skilled  enough  to  view  the 
resulting  information  and  share  it  in  space  or  on  a map  (Turner  2006;  Batty 
2010;  Wilson  & Graham  2013). 

All  historical  transformations  in  means  of  communication  have  led  to  a 
redefinition  of  lifestyles,  of  time  and  space  which  are  deeply  connected  in  soci- 
ety: their  meaning,  perceptions  and  manifestations  are  linked  to  social  prac- 
tices and  evolve  throughout  history  and  across  cultures.  The  transformations 
addressed  here  refer  both  to  the  development  of  Web  2.0  technologies  and  the 
diffusion  of  sensors  of  different  types  which  have  profoundly  modified  the  ways 
of  accessing,  producing,  diffusing  and  representing  geographic  information: 

‘ This  is  an  unprecedented  moment  in  human  history:  we  can  now  know  where 
nearly  everything , from  genetic  to  global  levels , is  at  all  times  ” (Sui  & Delyser 
2013:  p.13). 

In  this  sense  VGI  may  be  considered  a significant  innovation  and  as  with 
any  other  innovation  it  combines  technology,  social  practices  and  power  rela- 
tionships. First,  the  technology  relies  on  the  many  location-based  devices  used 
potentially  by  ordinary  citizens  who  become  sensors  and  on  Web  2.0  appli- 
cations which  enable  information  co-creation  (social  media,  photo-sharing 
platforms,  wiki  projects,  etc.)  in  huge  quantities.  Secondly,  the  phenomenon 
of  user-generated  content  is  part  of  a cultural  change  which  very  recently  has 
led  to  the  adoption  of  open  access  and  a collaborative  and  sharing  approach  to 
information  resources.  This  cultural  turn  has  been  defined  as  collective  intel- 
ligence by  the  French  philosopher  Pierre  Levy  (1994)  who  explains  that  ‘the 
collective  intelligence  tries  to  articulate  in  a new  way  the  individual  and  the 
collective  domains  in  a new  space  of  knowledge . This  concept  has  also  been 
discussed  by  Manuel  Castells  (1996,  2008)  who  has  explained  that  in  the 
information  age  there  is  a growing  juxtaposition  of  individualism  and  com- 
munalism:  networks  of  individuals  which  provide  the  basis  for  increasing  our 
sociability  as  individuals.  In  sociological  studies  many  argue  that  the  ‘bond  of 
community  has  been  lost  (Putman  1995)  and  that  there  is  a need  to  improve 
and  rebuild  social  capital;  in  this  sense  the  debate  on  the  contribution  of  social 
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media  to  social  capital  building  is  questionable.  It  is  hard  to  imagine  how  large- 
scale  social  movements  and  community  building  activities  can  be  organized 
effectively  on  the  basis  of  qualitative  practices  alone,  as  the  logistical  challenge 
requires  an  ability  to  plan  schedules,  develop  strategies  and  master  commu- 
nications technologies:  certainly  the  burgeoning  relationships  through  social 
media  betray  new  communication  and  social  practices.  Furthermore,  the  accel- 
erated deterritorialization  process  implies  the  removal  of  barriers  and  limits 
but  also  the  restoration  of  stronger  social  ties. 

Indeed  the  participative  and  collaborative  approach  is  emerging  in  contem- 
porary society  as  Jeremy  Rifkin  (2011)  explains:  people  are  biologically  predis- 
posed to  be  empathic — that  our  core  nature  is  not  rational,  detached,  acquisi- 
tive, aggressive,  and  narcissistic,  but  affectionate,  highly  social,  cooperative, 
and  interdependent.  Homo  sapiens  is  giving  way  to  Homo  empathicus\  Histori- 
ans tell  us  that  empathy  is  the  social  glue  that  allows  increasingly  individualized 
and  diverse  populations  to  forge  bonds  of  solidarity  across  broader  domains  so 
that  society  can  cohere  as  a whole. 

In  this  context  the  crowdsourcing  process  has  emerged  and  has  been  defined 
by  the  journalist  Jeff  Howe  (2008)  as  'the  process  by  which  the  power  of  the  many 
can  be  leveraged  to  accomplish  feats  that  were  once  the  province  of  a specialized 
few\  This  phenomenon  has  been  supported  by  the  so-called  sharing  economy’ 
as  a socio-economic  system  built  on  the  sharing  of  skills,  goods  and  services 
driven  by  the  increasing  sense  of  urgency  of  resource  depletion;  by  the  open 
source  movement  which  - at  least  in  theory  - enables  any  user  to  participate  in 
the  information  society  by  sharing  know-how  and  skills  mediated  by  Web  2.0 
tools  and  applications.  Famous  initiatives  like  Wikipedia,  founded  in  2001  or, 
more  specifically  in  the  realm  of  geography,  OpenStreetMap,  launched  in  2004, 
do  not  need  any  further  explanation  here. 

VGI  is  thus  closely  related  to  the  concept  of  crowdsourcing  as  it  is  an  asser- 
tive method  of  collecting  geospatial  information  from  people  who  are  mainly 
participating  in  Web-based  social  networking  sites,  in  citizen  science  initiatives 
or  in  the  context  of  collaborative  commons-based  peer  production  networks 
(Benkler  & Nissenbaum  2006). 

The  volunteering  element  is  the  subject  of  debate  in  VGI  literature  since 
contributors  may  produce  information  either  consciously  or  unconsciously: 
crowdsourcing  implies  a process  of  consensus  production  whereby  many  peo- 
ple will  provide  and  augment  information  about  the  same  thing  which  will 
become  more  and  more  accurate  thanks  to  a convergence  of  information.  In 
the  case  of  geographic  information  involuntarily  produced  by  individuals, 
quality  might  be  debated  since  data  are  often  collected  publicly  without  strict 
standardization  and  every  user  inserts  data  according  to  his/her  personal  back- 
ground and  point  of  view  (Coleman  2009).  In  fact  Harvey  (2013)  has  suggested 
the  definition  crowdsourced  geographic  information  which  refers  to  data  col- 
lected via  opt-out’  agreements  which  are  more  open-ended  and  offer  fewer 
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opportunities  to  control  the  data  collection,  and  subsequently  its  quality  and 
assessment.  In  contrast,  when  volunteered  information  production  - or  geo- 
graphic volunteer  work  (Priedhorsky  et  al.  2010)  - is  regulated  by  shared  rules 
concerning  the  geocoding,  tagging  and  annotation  of  the  data,  VGI  becomes 
part  of  citizen  science.  Citizen  science  has  emerged  from  the  fields  of  ecology, 
biology  and  nature  conservation,  whose  projects  are  based  on  volunteering  and 
contribution  of  information  on  areas  such  as  biodiversity,  environmental  qual- 
ity or  endangered  species  both  for  the  benefit  of  human  knowledge  and  science 
(Haklay  2013a).  This  is  demonstrated  by  the  countless  applications  in  the  field 
of  biodiversity  monitoring  and  environmental  assessment  (Bonney  et  al  2009; 
Gouveia  et  al.  2008)  as  emerging  problematic  issues  in  the  context  of  sustain- 
able development. 

Citizen  science  re-evaluates  the  separation  between  scientists  and  public,  and 
scientists  need  to  adjust  to  their  new  role  as  mediators  of  knowledge  rather 
than  as  the  sole  repository  of  scientific  truth:  ‘This  might  end  up  being  the  most 
important  outcome  of  citizen  science  as  a whole  as  it  might  eventually  catalyse 
the  education  of  scientists  to  engage  more  fully  with  society’  (Haklay  2013:14). 

In  this  context  VGI  embodies  either  the  implicit  or  explicit  relationship  of  the 
individual  to  the  world  and  represents  the  sense  of  belonging,  in  some  intrinsic 
way,  to  a larger  body,  whether  a nation  or  a neighbourhood;  this  relationship 
has  long  been  a critical  part  both  of  the  individual’s  motivation  to  act  in  some 
larger  interest  and  of  the  group’s  ability  to  exhort  the  individual  to  take  action 
and  participate  (Curry  1997). 

Before  closing  this  introductory  section,  it  is  worth  mentioning  that  litera- 
ture related  to  VGI  has  increased  enormously  in  the  recent  past  (see  the  Intro- 
duction of  this  Handbook)  which  may  lead  us  to  ask  why  VGI  is  so  appealing 
and  why  it  has  created  so  much  scientific  interest  in  geography.  The  main  driv- 
ers of  its  success  certainly  relate  to: 

a)  the  features  of  this  information  (the  non-expert  producers,  the  partici- 
patory approach,  the  huge  quantity,  the  real  time  accessibility,  the  finer- 
grained  resolution  and  the  scalability); 

b)  the  extremely  diversified  fields  of  potential  applications  (disaster  and  cri- 
sis management,  environmental  monitoring,  planning,  land  use,  mobility, 
people’s  behaviour  etc.)  which  are  more  and  more  employed  in  govern- 
ance and  in  the  management  of  public  services; 

d)  the  ‘wow’  component  due  to  unexpected,  creative  and  sometimes  amus- 
ing topics  which  can  be  tackled  spatially  with  these  sources.  This  is  well 
demonstrated  by  the  Floatingsheep  collective  which  started  in  2009  by 
producing  witty  and  entertaining  explorations  such  as  Santa  Claus’s 
homeland,  the  ‘beer  belly’  of  America,  zombies  vs  vampires  and  so  on:  ‘At 
FloatingSheep,  we’re  willing  to  search  for  and  analyse  almost  anything  that 
falls  within  the  realm  of  human  experience.  Sometimes  this  is  mundane 
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(pizza)  and  sometimes  it  is  contentious  (abortion)  but  most  of  the  time  it 
falls  somewhere  in  between.  Such  as,  where  can  I get  a drink?’;1 
e)  the  experiential  and  perceptional  nature  of  the  content  embedded  in  VGI 
which  can  be  distilled  both  to  achieve  a better  understanding  of  beliefs, 
practices  and  habits  and  potentially  challenge  the  dominant  narratives, 
because  VGI  is  built  on  the  understanding  of  the  social  world  mediated 
by  people’s  conversations  and  contributions,  thus  it  consists  of  social  prac- 
tices (Elwood  2008). 


The  components  of  VGI 

Generally  speaking  VGI  consists  of  a ‘big’  and  ongoing  flow  of  data  deriving 
from  different  tools  and  media  (mobile  phones,  cameras,  records  of  smartcard 
transactions,  social  platforms,  check-ins,  location-based  devices,  etc.);  they  are 
digital  footprints,  or  data  shadows,  or  by-products  of  human/machine  inter- 
actions (Graham  2013).  Such  digital  footprints  are  produced  by  anyone  who 
may  potentially  act  as  a sensor  and  provide,  more  or  less  consciously,  valu- 
able information  (Capineri  & Rondinone  2011;  Haklay  et  al.  2008;  Sui  2008)  by 
applying  local  and  sectorial  knowledge  since  producers  are  ‘[...]  equipped  with 
some  working  subset  of  the  five  senses  and  with  the  intelligence  to  compile 
and  interpret  what  they  sense,  and  each  free  to  rove  the  surface  of  the  planet’ 
(Goodchild  2007). 

More  precisely  the  essential  components  of  VGI  are: 

1)  the  geographical  references  (i.e.  geotag,  coordinates,  geographic  name) 
which  enable  the  information  to  be  represented  on  a map  and  thus  satisfy 
the  eternal  human  desire  to  know  ‘where  we  are’  or  ‘where  things  are’; 

2)  the  stock  of  content  which  makes  it  possible  to  transform  this  data  into 
information  and  possibly  knowledge.  The  content  may  take  different  forms: 
images,  texts,  symbols,  maps,  check-ins,  photos,  videos,  drawings,  etc. 

3)  attributes,  of  various  degrees  of  accuracy,  of  content  users  and  content  pro- 
ducers ( produsers , Coleman  et  al.  2009)  (such  as  nationality,  language  and 
possibly  age  and  gender)  and  the  time  of  the  digital  footprint’s  creation. 

The  combination  of  three  components  provides  a powerful  way  to  aggregate, 
synthesize  and  compare  information  at  different  scales  on  specific  issues  or 
events  which  are  occurring  either  in  real  time  or  in  longer  time  spans. 

Thus  VGI  components  may  be  represented  as  the  three  corners  in  a standard 
ternary  diagram  (Figure  1)  to  show  that  employment  of  this  data  may  vary 
according  to  the  emphasis  given  to  one  or  more  of  the  components  or  to  the 
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1.  Where  / Location 

Spatial  perspective 


2.  What  /content 
Place  perspective 


3.  Who  /prosumers 

Participation,  motivation, 
community  of  interests 


Figure  1:  VGI,  a diagram  representation  of  the  components. 


changing  balance  between,  for  example,  ecological,  spatial  or  regional  synthe- 
sis. The  corners  represent  the  components  and  the  related  perspectives:  (1)  the 
where  of  the  information  given  by  the  geographical  reference  which  serves  a 
spatial  perspective,  (2)  the  what  given  by  the  stock  of  content  which  suits  place- 
oriented  and  qualitative  analysis,  (3)  the  who  given  by  the  produsers  which  fits 
well  the  analysis  of  the  participation  and  production  process.  The  barycentres 
position  represents  the  different  composition  of  the  three  VGI  components  and 
it  may  vary  according  to  the  chosen  perspective. 

Of  course  the  representation  is  oversimplified  since  at  least  the  time  dimen- 
sion needs  to  be  added  to  give  a fuller  picture  of  the  potentialities  of  the  rela- 
tionship between  the  components  of  VGI  sources.  Moreover  the  complexity  of 
VGI  data  lies  in  the  fact  that  analysis  needs  to  draw  on  cognitive,  psychological 
and  anthropological  inputs  to  fully  appreciate  and  exploit  this  data,  thus  going 
beyond  simple  representation  on  a map.  The  fragmented  individual -level  con- 
tents from  the  crowd  provide  qualitative  information  which  was  unreachable 
in  the  past  through  traditional  direct  investigations  (i.e.  surveys,  interviews, 
etc.)  or  official  data  (i.e.  census);  the  employment  of  qualitative  information 
is  not  new  in  geography,  as  it  was  the  pillar  of  the  perception  and  behavioural 
approach  (Claval  1974),  but  the  innovative  aspects  are,  in  addition  to  the  quan- 
tity and  the  scale  (from  global  to  local  and  vice  versa),  the  granularity  of  the 
topic  and  the  timeliness  that  VGI  allows.  As  Table  1 shows,  very  specific  events 
or  unexpected  topics,  like  beverage-consumption  habits,  can  be  now  quite  eas- 
ily addressed  and  explored. 

In  the  following  sections  each  of  the  components  will  be  briefly  discussed 
drawing  examples  both  from  existing  literature  and  my  own  research. 
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Event-dependent 

Activity  time 

User  generated 
contents 

Scale 

Habemus  Papam 
(Source:Ladest) 

13-14  March  2013 

10,000  Geocoded 
Tweets 

Global 

Palio  di  Siena 
(SourceiLadest) 

01-02  July  2013 

375  Geocoded 
Tweets 

Urban 

Hurricane  Sandy  in  NYC 
(Source:  Shelton  et  al.  2014) 

24-30  October 

16,000  Geocoded 
Tweets 

Urban 

Topic-oriented 

Church  or  Beer 
(Source:  FloatingSheep) 

22-28  June  2012 

17,686  (church)  + 
14,405  (beer) 

National 

Table  1:  Some  examples  of  event-dependent  or  topic-oriented  applications  in 
VGI. 


The  geographical  reference:  living  space  and  local  knowledge 

The  geographical  component  represents  the  raw  digital  footprint  that  can 
be  represented  in  space  as  the  manifestation  of  the  producers  activity  - or 
inactivity  - on  the  Web.  Although  the  act  of  producing  information  is  to  a 
greater  or  lesser  degree  voluntary,  locating  the  origin  of  this  data  on  the  Earths 
surface  highlights  the  constellations  of  participation  on  the  Web  thanks  to 
geocoding  attributes  (geotags;  geographic  names,  coordinates).  The  footprints 
offer  a preliminary  source  of  information  which  reveal  the  produsers  appro- 
priation of  place  by  naming,  tagging  or  annotating  it. 

The  geographical  component  has  a phenomenological  value  in  itself  since 
it  is  often  a response  to  a stimulus,  either  an  event  or  a simple  desire  to  get  in 
touch  with  friends  and  show  where  you  are  or  where  you  have  been.  Many 
case  studies  show  strong  uneven  geographical  patterns  of  participation  in  the 
production  of  VGI  which  may  be  described  through  the  digital  divide  meta- 
phor (Graham  2014).  The  divide  maybe  caused  by  different  reasons  such  as  the 
uneven  diffusion  of  the  technology  but  also  the  ability  to  work  with  or  benefit 
from  it.  The  distribution  of  geotagged  Wikipedia  articles  (Figure  2)  supports 
this  assumption.  The  articles  on  Wikipedia  are  an  example  of  pure  volunteered 
geographic  information  because  users  have  added  content  in  Wikipedia  delib- 
erately, and  despite  the  pervasiveness  of  the  internet,  recent  analysis  shows  that 
the  practice  of  producing  content  is  mainly  concentrated  in  the  United  States, 
Western  Europe,  Japan  and  in  some  emerging  countries  in  South  America  and 
Asia.  The  uneven  distribution  is  mainly  explained  by  the  traditional  variables 
of  wealth  such  as  population,  GDP  per  capita  and  broadband  internet  connec- 
tions (Graham  et  al.  2014). 
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Figure  2:  Geotagged  Wikipedia  articles  per  country  in  all  languages  (Source: 
Graham  et  al.  2014). 


Figure  3:  Habemus  Papam:  Tweeting  activity  on  the  Pope’s  election  day  (Source: 
Ladest  & Unisi  2013). 


When  dealing  with  crowdsourced  information  derived  from  social  media, 
other  types  of  uneven  distribution  may  emerge.  The  map  (Figure  3)  show- 
ing the  Tweets  sent  about  the  election  of  Pope  Francis  in  March  2013  (10,000 
Tweets  collected  on  13-14  March  2013)  is  an  expression  of  a heterogeneous 
community  of  interest  which  has  spontaneously  responded  to  a particular 
event,  unaware  of  the  fact  that  their  Tweets  might  be  collected  and  analysed: 
here  the  uneven  distribution  appears  smoother  than  Wikipedia’s  divide  due 
to  the  religious  or  political  appeal  that  the  Pope’s  election  may  have  (note  the 
intense  activity  in  the  Arab  Gulf  and  in  Western  Africa). 

This  uneven  participation  pattern  on  the  Pope’s  election  day  requires  further 
investigation  in  order  to  discover  the  reasons  for  such  disparities,  which  could 
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include  critical  aspects  of  religious  beliefs  and  cultural  proximity:  the  Pope’s 
election  map  is  a manifestation  of  an  event-related  social  activity  which  has 
converged  on  the  election  of  Pope  Francis  and  will  soon  fade  away  The  uneven 
participation  in  Wikipedia,  which  is  a long-standing  project,  is  more  concerning 
than  that  described  by  the  Pope’s  election  map  since  it  highlights  a consolidated 
process  of  information  production  which  reproduces  well-known  dichotomies 
(North- South;  developed-less  developed):  so  the  apparent  democratisation  of 
information  and  knowledge  remains  debatable  (Haklay  2013b). 

Nevertheless,  the  great  innovation  is  that  this  data  enables  us  to  locate  spa- 
tially practices  and  topics  (especially  cultural  and  political  ones)  pertaining  to 
people’s  everyday  lives  which  reflect  specific  time-space  configurations  at  differ- 
ent scales  and  with  a degree  of  fine  resolution  which  was  impossible  in  the  past. 

From  a methodological  point  of  view,  the  location  of  the  information  may 
be  questionable  since  georeferencing  creates  many  ambiguities:  (a)  the  same 
name  may  be  used  for  more  than  one  location,  (b)  the  same  location  can  have 
more  than  one  name  and  (c)  place  names  can  be  used  in  non-geographic  con- 
texts such  as  organizations,  events  or  personal  names.  Furthermore,  the  greater 
or  lesser  awareness  of  the  produsers’  in  the  information  generation  process  may 
affect  accuracy  and  quality:  while  prosumers  participating  in  citizen  science 
activities  adopt  and  share  recording  rules,  unaware  citizens  just  use  the  tech- 
nologies for  their  own  purposes  and  do  not  generally  pay  attention  to  the  geo- 
referencing of  the  information,  despite  adding  relevant  content.  For  example,  if 
we  search  for  ‘Chianti’,  the  famous  wine-producing  area  in  Tuscany,  on  Flickr, 
among  the  photos  there  is  one  of  the  bronze  sculpture  of  Perseus  with  the  head 
of  Medusa  by  Benvenuto  Cellini  which  is  located  near  the  Uffizi  in  Florence:2 
in  this  case  the  tag  ‘Chianti’  does  not  enable  us  to  place  the  photo  in  the  correct 
location  but  nevertheless  it  offers  other  hints  relevant  to  a place-based  analysis: 
indeed  ‘Chianti’  is  one  of  the  many  tags  of  this  picture  (Florence,  wine,  holiday, 
Stendhal  Syndrome,  lovely  city,  Arno  river)  which  are  employed  by  the  user  to 
describe  the  spirit  of  his/her  Tuscan  holiday  experience. 


The  stock  of  contents:  place  and  qualitative  information 

The  content  reveals  the  ‘sticky  places’  in  the  fluid  information  flows:  a world 
of  places  of  knowledge  which  not  only  tell  stories  of  VGI  ‘birthplaces’  but  col- 
lect the  added  value  generated  by  the  produsers.  Contents  may  be  either  ‘neu- 
tral/locational’  if  carrying  simply  positional  information  (i.e.  an  address)  or 
descriptive  if  they  take  the  form  of  texts,  comments,  images,  drawings  or  video 
clips.  The  stock  of  contents  records  points  of  view,  values,  feelings,  expres- 
sions of  appreciation  or  contempt,  of  happiness  and  unhappiness;  in  short  they 


2 https://www.flickr.com/photos/75992994@N05/15607974697/in/photolist-pMdX72-pkvq8o- 
6JxVtE-4WVhxs-5W7cNo-78kME8-sjwVKi  [Accessed  April  2016]. 
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represent  the  sense  of  place  engineered  by  the  Web  because  VGI  contributors 
are  engaged  in  knowledge  production  processes  which  are  grounded  in  social 
structures  and  sets  of  values,  and  in  turn,  physical  place  (Hardy  et  al.  2012:  3; 
Lussault  2007). 

From  this  point  of  view,  VGI  content  capitalizes  the  informal  knowledge  of 
the  producers  and  becomes  a collector  of  multiple  identities  and  perceptions 
which  highlight  the  variegated  relationships  with  a certain  place  such  as  inclu- 
sion and  exclusion,  sharing  and  reacting  and  so  on. 

It  is  true  in  fact  that  VGI  content  incorporates  the  situatedness  of  individuals 
and  the  invisible  knowledge  of  the  producer,  the  location  and  the  fluidity  of 
perceptions  (Zook  & Graham  2007). 

The  number  of  content  contributors  combined  with  the  ability  to  annotate 
place  in  the  geoweb  may  result  in  dense  layers  of  information  augmenting  some 
parts  of  the  world  which  describe  ‘the  indeterminate,  unstable,  context  depend- 
ent and  multiple  realities  brought  into  being  through  the  subjective  coming- 
togethers  in  time  and  space  of  material  and  virtual  experience”  (Graham  et.  al. 
2012:  465;  Graham,  Zook  & Boulton  2013).  Several  scholars  (Graham  2010; 
Crang  1996)  use  the  metaphor  of  palimpsests,  with  reference  to  medieval  writ- 
ing blocks  that  could  be  reused  while  maintaining  traces  of  earlier  inscriptions: 
‘the  countless  layers  of  any  place  come  together  in  specific  times  and  spaces  and 
have  bearing  on  the  cultural,  economic,  and  political  characteristics,  interpre- 
tation and  meaning  of  a place”  (Graham  2010:  422). 

For  example,  when  reading  the  English  and  Farsi  versions  of  the  description 
of  the  town  Esfahan  in  Iran  on  Wikipedia,  the  information  is  slightly  different 
both  in  terms  of  the  images  shown  and  of  content:  the  English  version  has  more 
stereotyped  pictures  of  the  towns  blue  and  white  ceramic  decorations  than  the 
Farsi  version,  which  also  contains  descriptions  of  local  artefacts.  In  addition, 
the  nuclear  activity  which  takes  place  close  to  the  city  is  omitted  in  the  Farsi 
version  since  the  topic  is  clearly  regarded  as  a sensitive  one. 

In  this  way  crowdsourced  information  becomes  particularly  relevant  in  the 
production  and  acquisition  of  local  knowledge  either  through  place  names 
or  practices  and  values.  Peoples  contributions  in  VGI  tend  to  be  more  accu- 
rate in  places  the  contributor  knows  best  and  is  nearer  to,  in  accordance  with 
Tobler  s law  which  states  that  ‘everything  is  related  to  everything  else  but  near 
things  are  more  related  than  distant  things  ” (Tobler  1970).  As  such,  some  lit- 
erature hypothesizes  that  (a)  contributors  write  about  nearby  places  more  often 
than  distant  ones  and  that  (b)  this  likelihood  follows  an  exponential  distance 
decay  function  (Hardy  et  al.  2012).  Indeed,  according  to  recent  research,  about 
50  percent  of  Flickr  users  contribute  local  information  on  average,  and  over 
45  percent  of  Flickr  photos  are  local  to  the  photographer  (Hecht  & Gergle 
2010).  Local  knowledge  deriving  from  VGI  has  remarkably  been  applied  to  ver- 
nacular geography,  which  had  been  eroded  by  the  quantitative  approach,  which 
‘encapsulates  the  spatial  knowledge  that  we  use  to  conceptualize  and  commu- 
nicate about  space  on  a day-to-day  basis.  Importantly,  it  deals  with  areas  which 
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are  typically  not  represented  in  formal  administrative  gazetteers  and  which  are 
often  considered  to  be  vague’”  (Hollenstein  & Purves,  2005:  22)  such  as  ‘down- 
town’” (Hollestein  & Purves  2010),  ‘neighbourhood’3  or  regions  like  the  ‘Alps’” 
(Purves  & Derungs  2015)  from  the  tags  associated  with  georeferenced  images 
and  place  marks. 


The  urban  character 

The  settings  of  VGI  are  mainly  urban:  most  of  the  crowdsourced  information  - 
especially  if  derived  from  social  media  platforms  - is  produced  in  urban  areas 
which  combine  connection  facilities  (internet,  free  WiFi,  hotspots  etc.)  and  the 
concentrated  critical  mass  of  city  users  (residents,  tourists,  business  people, 
commuters,  students,  visitors,  etc.). 

A recent  study  (Hecht  & Stephens  2014)  reveals  that  in  the  US  there  are  3.5 
times  more  Twitter  users  per  capita  in  core  urban  counties  than  rural  counties; 
the  same  authors  have  discovered  that  urban  users  Tweet  more  than  their  rural 
counterparts. 

The  following  image  (Figure  4)  shows  the  relationship  between  Twitter  activ- 
ity at  night  and  large  urban  areas  in  Italy  (2013):  52%  of  the  georeferenced 
Tweets  (12,000  collected  in  one  night)  fall  within  the  boundaries  of  the  Italian 
large  urban  zones  (LUZ),  as  defined  by  EUROSTAT;  the  percentage  would  be 
higher  if  the  peri-urban  areas  were  included. 

VGI  has  been  used  in  interesting  ways  in  spatial  analysis  of  cities’  urban 
structures.  In  fact,  recent  research  has  identified  ‘natural  cities’  as  human  set- 
tlements, or  human  activities  in  general  on  the  Earth’s  surface,  that  are  delin- 
eated from  massive  geographic  information  derived  from  geocoded  social 
media  data  (Jiang  & Liu  2012;  Jiang  in  this  book).  The  interesting  aspect  of  this 
approach  is  that  the  employment  of  crowdsourced  information  allows  us  to  see 
how  cities  evolve  and  change  over  time,  even  if  the  changes  may  simply  refer  to 
the  espace  vegu  (living  space)  by  city  users.  Here  is  an  example  of  Jiang’s  meth- 
odology applied  to  Florence  (Italy)  which  shows  the  concentration  changes  of 
geocoded  Tweets  for  six  months  (May-October  2013). 

VGI  has  contributed  to  the  discovery  of  other  urban  issues.  For  example, 
analysis  of  geolocated  Flickr  photos  has  identified  the  most  attractive  spots  or 
intra-urban  tourist  routes  (Girardin  et  al.  2008)  and  discovered  that  foreigners 
privilege  stereotyped  places  mainly  concentrated  in  the  city  centre  or  at  trans- 
port nodes  (airports,  stations)  while  local  people’s  gaze  seems  to  fall  on  less 
central  locations  (Crandall  et  al.  2009;  Straumann  et  al.  2014). 

If  the  annotations  (comments,  texts)  of  crowdsourced  data  are  taken  into 
account,  narratives  about  urban  settings  may  be  constructed  and  may  vary 
according  to  whether  they  are  produced  by  insiders  or  outsiders.  The  following 


3 See  http://livehoods.org/. 
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table  (Table  2)  attempts  to  show  the  different  qualitative  narratives  about  Lon- 
dons attractions  which  emerge  from  different  VGI  sources.  The  table  has  been 
created  by  selecting  the  10  most  cited  attractions  on  Trip  Advisor  and  the  10 
most  frequently  used  hashtags  on  the  Twitter  account  @Londonist;  it  high- 
lights the  sites  mentioned,  the  related  attributes  and  the  meaning  which  can 
be  ascribed.  TripAdvisor  s posts  are  generally  created  by  outsiders  (or  tour- 
ists) and  highlight  the  persistence  of  global  imagery  and  the  grandeur  ideal 
of  certain  monuments  which  are  at  the  base  of  a model  of  collective  knowl- 
edge which  endlessly  reproduces  itself  over  time  (Raffestin  1988).  In  contrast, 
the  most  used  hashtags  from  the  profile  @Londonist  may  be  considered  the 


Figure  4:  A comparison  between  Tweets  sent  at  night  and  large  urban  zones  in 
Italy  (Source:  Ladest  2016). 
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Figure  5:  The  urban  espace  vegu  in  Florence  (May- October  2013). 
Source:  Ladest  Lab  2013. 
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Source 

Sites 

Comments  / 
Attributes 

Meaning  /process 

Trip  Advisor 

Tower  of  London, 

Big  Ben,  Tower 

Bridge,  St.  Paul’s 
Cathedral,  London  Eye, 
Buckingham  Palace, 
Trafalgar  Square, 
Piccadilly  Circus 

Wonder,  uniqueness, 
grandeur,  beauty, 
excitement,  surprise 

Tourist  place 
commodification 

Global  imagery 

Covent  Garden, 

Camden  market 

Curious,  unusual 

@Londonist 

Artistic  and  folk  events 
Suburban 
neighborhoods 
(i.e. Walthamstow  in 

East  London) 

Wimbledon 

River  Thames 

Parks 

Silence 

Peaceful 

Relaxing 

Entertaining 

Cheap 

Environmental 

quality 

Urban  free  time 
Local  community 
life 

Table  2:  Qualitative  crowdsourced  information  about  the  London  urban  area. 


manifestation  of  insiders  preferences:  they  reveal  Londoners’  desire  for  more 
intimate  and  less  crowded  places  and  the  quest  for  sites  where  they  can  find 
better  environmental  quality,  social  ties  and  green  areas. 


Conclusions 

Crowdsourced  information,  namely  volunteered  geographic  information  (VGI), 
is  a revolutionary  source  of  information  for  increasing  spatial  and  behavioural 
knowledge  on  different  topics  or  phenomena  in  contemporary  and  everyday  life 
from  politics  to  the  environment,  from  cultural  events  to  natural  disasters  and 
more.  The  advances  in  geospatial  technologies  in  the  past  twenty  years  have  ena- 
bled ordinary  citizens  with  little  formal  training  to  participate  in  the  production 
of  geographic  data  and  knowledge  through  diverse  forms  of  user-generated  con- 
tent and  VGI:  everyday  activities  may  be  transformed  into  creative  expressions 
that  can  be  uploaded,  modified  and  shared  in  the  digital  world  (Sui  & Delyser 
2012;  Parks  2001).  The  multiple  identities  of  VGI  have  been  described  as  hybrid 
geographies  which  consider  creative  connections  within  geographies  - physi- 
cal and  human,  critical  and  analytical,  qualitative  and  quantitative  - aiming  to 
integrate  perspectives  on  place,  revealing  interactions  and  society  at  large  (Sui  & 
DeLyser  2012:  112).  From  a spatial  perspective,  VGI  is  an  attempt  to  break  free 
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from  institutional  boundaries  (municipalities,  regions,  provinces,  counties,  etc.) 
as  shown  by  the  application  of  the  long  tail  model  in  urban  development  based 
on  VGI  data  (Jiang  2014)  or  the  employment  of  clustering  methodologies  based 
on  geolocated  crowdsourced  data  which  produce  areas  of  diffusion  of  interests, 
emotions  and  conflicts.  From  a place  perspective,  places  - or  more  traditionally 
regions  - identified  are  based  on  people  who  are  at  or  experiencing  a certain 
place  and  deliver  different  type  of  information  which  capture  social  practices 
and  ongoing  processes,  either  peaceful  or  conflictual. 

As  with  any  other  scientific  advance,  VGI  provides  new  food  for  thought 
and  has  raised  many  epistemological  questions  in  the  field  of  geography 
(Kit chin  2013,  2014)  which  have  led  to  arguable  definitions  stating  that  data 
can  speak  for  itself  and  that  theory  is  no  longer  needed  (Anderson  2008). 
But  crowdsourced  information  is  not  just  facts  devoid  of  context:  it  may  pro- 
vide large  quantities  of  geographic  information  which  need  to  be  distilled  and 
exploited  within  clear  theoretical  frameworks  (Kitchin  2013;  Sui  & Delyser 
2012).  Similarly  to  what  happened  in  the  19th-century  when  geographer- 
explorers  needed  precise  measuring  instruments  and  binoculars  to  record 
their  observations  of  the  new  lands  and  the  resultant  collaboration  with  natu- 
ralists, surveyors  and  biologists,  nowadays,  the  geographer  working  with  VGI 
data  needs  both  the  computing  expertise  to  scrape  and  organise  data  from  the 
Web  and  the  cognitive  tools  to  reach  the  inner  meaning  of  this  information. 
In  this  scenario  new  alliances  have  emerged  between  geography  and  comput- 
ing and  the  cognitive  sciences:  the  tools  for  making  good  use  of  VGI  he  in 
the  methodologies  both  for  geolocating  the  information  and  for  qualitatively 
analysing  the  contents  around  which  discourses  and  narratives  can  be  built 
on  different  scales. 

Finally,  we  should  praise  VGI  and  its  capacity  to  deal  both  with  both  every 
day  and  more  fundamental  topics  that  were  unreachable  in  the  past  in  a timely 
manner  and  with  the  heterogeneity  of  social  phenomena  by  locating  them 
on  the  Earths  surface.  There  is  still  a great  deal  to  do  to  make  sense  of  these 
distributions  but  undoubtedly  the  pulse  of  life  with  all  its  contradictions  and 
inequalities  can  be  grasped  through  a kind  of  information  that,  although  it  is 
currently  concentrated  in  certain  countries  and  areas,  is  likely  to  grow  rather 
than  shrink  and  hopefully  to  become  ever  more  inclusive. 
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Abstract 

Participation  inequality  - the  phenomenon  that  a very  small  percentage  of  par- 
ticipants contribute  a very  significant  proportion  of  information  to  the  total 
output  - is  persistent  across  Volunteered  Geographic  Information  (VGI)  and 
citizen  science  projects.  It  has  been  identified  in  both  online  and  offline  pro- 
jects that  rely  on  volunteers  effort  over  the  past  20  years  and,  therefore,  can  be 
expected  to  appear  in  new  projects.  This  chapter  looks  at  participation  inequal- 
ity (also  known  as  the  1%  rule  or  the  90-9-1  rule),  its  origins  and  some  of  its 
characteristics.  The  chapter  also  explains  how  participation  inequality  emerges 
in  a project  at  both  temporal  and  spatial  scales,  and  also  evaluates  its  implica- 
tion on  the  use  of  VGI  and  citizen  science  data.  The  chapter  suggests  a generic 
rule  for  analysts  of  VGI  and  citizen  science  datasets,  in  the  form:  c When  using 
and  analysing  crowdsourced  information , consider  the  implications  of  participa- 
tion inequality  on  the  data  and  take  them  into  account  in  the  analysis' 
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Introduction 

One  of  the  most  persistent  aspects  that  can  be  noted  in  systems  which  facili- 
tate user-generated  content  (among  them  volunteered  geographic  information 
and  citizen  science  data)  is  the  inequality  in  the  level  of  participation  that  they 
exhibit.  According  to  Jakob  Nielsen  (2006),  participation  inequality  was  first 
recognised  by  Hill  and  his  team  (1992)  while  studying  the  development  of  digi- 
tal documents  and  analysing  the  contributions  by  different  people  to  the  final 
product.  It  manifests  itself  in  online  forums  such  as  mailing  lists,  discussion 
forums,  games  and  ecological  observations  (e.g.  Hill  et  al.  1992;  Mooney  & Cor- 
coran 2012;  Lund  et  al.  2011;  van  Mierlo  2014;  Silvertown  et  al.  2015).  In  each 
of  these  cases,  the  overwhelming  majority  of  people  who  use  the  information  or 
are  registered  to  the  service  do  not  contribute  any  information  to  it.  The  propor- 
tion of  registered  people  who  do  not  contribute  can  reach  90%  or  even  more  of 
the  total  number  of  users.  Of  the  remaining  participants,  the  vast  majority  con- 
tribute infrequently  or  fairly  little  - these  account  for  9%  or  more  of  the  users. 
Finally,  the  last  1%  contribute  most  of  the  information.  This  has  led  to  framing 
the  phenomenon  as  the  90-9-1  rule  (Nielsen  2006).  However,  participation  can 
be  very  skewed.  As  Nielsen  demonstrates,  in  Wikipedia,  0.003%  of  users  con- 
tribute two-thirds  of  the  content,  with  a further  0.2%  contributing  infrequently, 
making  the  relationship  99.8-0.2-0.003%  (with  the  increased  use  of  Wikipe- 
dia since  2006,  the  situation  has  worsened).  There  is  some  evidence  to  suggest 
that  the  proportion  can  be  different  - for  example,  Budhathoki  (2010)  suggests 
that  in  OpenStreetMap  the  proportions  are  70-29.9-0.01%.  Recent  analysis  by 
Harry  Wood  (2014)  provides  an  indication  of  this  relationships  (Figure  1),  with 
the  contribution  of  the  first  ranked  1,000  participants  dwarfing  the  effort  of  all 
other  contributors,  and  only  about  300,000  participants  contributing  more  than 
10  points  of  data  - although  at  the  time  there  were  2 million  registered  users. 

Participation  inequality  has  been  observed  in  VGI  and  citizen  science  pro- 
jects such  as  OpenStreetMap  (Budhathoki  2010;  Mooney  & Corcoran  2012; 
Neis  & Zipf  2012),  Galaxy  Zoo  (Ponciano  & Brasileiro  2014)  and  bird  watching 
(Cooper  & Smith  2010).  It  is  especially  noteworthy  that  participation  inequal- 
ity is  not  only  appearing  in  online  projects,  but  also  can  be  observed  in  projects 
that  mainly  happen  offline,  such  as  participation  in  environmental  volunteer- 
ing or  when  analysing  the  levels  of  contribution  of  different  volunteers  in  bio- 
logical observations  across  London. 

In  this  chapter,  we  look  at  the  implications  of  participation  inequality  and 
argue  that  it  is  among  the  most  significant  aspects  of  VGI  and  citizen  science. 
We  start  by  noticing  what  we  already  know  about  participation  inequality  and 
its  manifestations.  This  is  followed  by  suggesting  possible  explanations  for 
how  it  occurs  and  evolves  over  time.  The  fourth  section  discusses  the  potential 
implications  on  project  development  and  the  use  of  information  that  emerges 
from  it.  We  conclude  with  open  research  questions  and  future  directions  for 
investigation  that  are  of  specific  interest  to  researchers  of  VGI. 
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Figure  1:  OpenStreetMap  contributions  (Wood  2014). 
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Throughout  the  chapter,  OpenStreetMap  is  being  used  to  demonstrate  the 
nature  and  implications  of  participation  inequality.  While  OpenStreetMap  have 
specific  characteristics  in  terms  of  participants  profiles  and  social  dynamics 
(Budhathoki  2010;  Haklay  2010),  it  can  be  used  to  illustrate  the  general  aspects 
of  the  phenomena.  Other  projects  are  being  used  to  augment  the  picture. 


Participation  inequality  - what  do  we  know? 

Unlike  command  and  control  processes  that  are  common  in  industrial  informa- 
tion creation,  VGI  and  citizen  science  are  produced  through  a distributed,  less 
coordinated  system.  Within  industrial  processes,  there  is  scope  for  planning  of 
coverage  and  allocation  of  resources.  For  example,  when  planning  the  surveying 
of  a city  in  an  industrial  process,  it  is  possible  to  divide  the  efforts  of  the  survey- 
ors to  ensure  uniform  level  of  coverage  and  time  allocation  to  different  parts  in 
proportion  to  the  amount  of  work  that  is  required.  Of  course,  the  abilities  of  the 
different  surveyors  will  have  an  impact  on  the  final  results  but,  in  general,  these 
can  be  minimised  through  quality  assurance  so  the  final  product  is  uniform. 

Within  a system  that  relies  on  crowdsourcing  - the  use  of  a large  group 
of  people  with  whom  there  are  no  direct  employment  relationships  - there  is 
far  less  ability  to  dictate  to  the  participants  where,  when  and  how  they  should 
contribute  information.  For  example,  in  a system  that  provides  traffic  informa- 
tion on  the  basis  of  users’  satellite  navigation  devices,  there  is  a co -dependence 
between  the  number  of  users  in  a given  location  and  the  ability  to  provide 
information  about  this  place.  Moreover,  because  the  devices  are  used  within  the 
context  of  daily  activities,  such  as  the  school  run  or  a trip  to  the  local  supermar- 
ket, there  will  be  more  information  about  places  in  which  many  people  travel 
daily  (e.g.  city  centre)  and  especially  during  rush  hour.  While  both  industrial 
and  crowdsourced  systems  are  socio-technical  systems,  in  the  latter  the  ‘socio’ 
requires  special  attention,  particularly  to  the  way  it  influences  the  resulting 
information  that  emerges  from  the  system. 

In  the  case  of  participation  inequality,  since  it  has  been  so  persistent  over 
the  years,  it  is  highly  likely  to  appear  in  any  crowdsourcing  project.  It  has  been 
observed  from  the  pre-Web  internet  messaging  system  Usenet  (Whittaker  et 
al.  1998)  to  current  large-scale  online  citizen  science  (Ponciano  & Brasileiro 
2014).  It  is,  therefore,  part  and  parcel  of  VGI  and  citizen  science. 

Just  as  interesting  is  that  the  phenomenon  repeats  itself  at  various  scales 
(something  akin  to  Power  Laws),  so  analysing  the  level  of  participation  in 
OpenStreetMap  for  the  area  of  London,  Europe  or  across  the  world  will  show 
participation  inequality  (Haklay  2010;  Mooney  & Corcoran  2012;  Neis  & Zipf 
2012).  Participation  inequality  also  occurs  at  different  temporal  scales  of  weeks, 
months  or  years  (Neis  & Zipf  2012).  As  can  be  expected  with  statistical  analysis 
of  this  sort,  the  larger  the  area  or  the  longer  the  time  frame,  the  clearer  the  pat- 
tern and  the  position  of  various  participants. 
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Another  important  aspect  known  about  participation  inequality  is  that  low- 
ering the  barrier  for  participation  does  help,  but  to  a limited  extent.  Even  in  vol- 
unteer computing  projects,  in  which  participants  download  software  to  their 
computers  that  utilises  unused  processing  resources  for  scientific  research,  par- 
ticipation inequality  persists.  IBM  World  Community  Grid  serves  as  an  exam- 
ple. This  project  is  an  aggregator  of  volunteer  computing  projects,  and  yet  few 
members  contributed  most  of  the  processing.  Of  the  350,000  participants,  the 
top  contributor  has  contributed  325  times  more  than  the  250th  contributor, 
and  875  times  more  than  the  1,000th  contributor. 

The  use  of  a leader  board  and  providing  credits  to  emphasise  the  position  of 
participants  has  been  shown  to  encourage  competition  among  contributors, 
but  with  a potential  to  alienate  some  participants  and  reduce  their  motivation 
(Massung  et  al.  2013).  The  assumption  that  it  is  always  valuable  to  encourage 
competition  among  participants  to  yield  more  information  should  be  ques- 
tioned, and  there  are  alternative,  such  as  the  mechanism  that  encourage  col- 
laboration that  Silvertwon  et  al.  (2015)  offer. 

Participation  inequality  also  manifests  itself  through  geographic  and  tem- 
poral patterns.  Thus,  places  that  are  within  the  coverage  area  of  highly  active 
participants  will  have  more  contributions  than  areas  that  do  not  have  many 
participants.  More  generally,  the  geographic  distribution  of  information  shows 
that  some  places  are  more  popular  and  receive  much  more  attention  than  oth- 
ers. Similarly,  the  temporal  pattern  of  highly  active  contributors  has  a dispro- 
portionate impact  on  the  temporal  patterns  of  data  collection  activities  as  a 
whole.  Thus,  the  sleeping  and  working  patterns  that  can  be  observed  within  the 
contributed  information  will  be  influenced  by  the  practices  of  high  contribu- 
tors (Yasseri  et  al.  2013). 

Finally,  while  high  contributors  receive  a lot  of  attention,  in  comparison  to 
the  very  large  group  of  people  who  contribute  very  little  both  individually  and 
to  the  overall  size  of  the  dataset,  we  should  not  forget  that  they  are,  statistically, 
outliers.  They  are  not  representative  of  the  overall  population,  nor  should  we 
expect  them  to  be  so.  There  is  a need  to  have  the  majority  of  people  as  consum- 
ers of  information,  as  otherwise  the  producers  would  lose  the  raison  d'etre  to 
create  and  share  information. 


How  participation  inequality  evolves  over  time  and  space 

One  of  the  puzzling  questions  regarding  participation  inequality  is  how  it 
evolves.  After  all,  at  first  look  the  participants  are  acting  as  volunteers  and 
therefore  there  is  no  limitation  on  the  number  of  people  who  can  join  a spe- 
cific activity  in  citizen  science  or  VGI  or  how  much  each  of  them  contributes. 
Second,  arguably,  the  actions  of  one  participant  do  not  stop  another,  for  exam- 
ple when  viewing  the  same  bird  or  taking  a geotagged  picture  of  Big  Ben  (see 
Jayaraman  2012).  Furthermore,  the  participants  are  only  loosely  coordinated 
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and  therefore  not  necessarily  aware  of  the  actions  of  other  participants,  and 
there  is  no  reason  for  one  to  compete  with  another  or  even  be  aware  of  their 
contribution.  However,  some  of  these  observations  are  inaccurate,  and  a fur- 
ther analysis  of  the  process  that  created  participation  inequality  can  explain  the 
source  of  the  observed  patterns. 

Firstly,  we  can  start  by  noticing  that,  in  many  VGI  and  citizen  science  pro- 
jects, some  resource  is  finite.  For  example,  in  OpenStreetMap  or  Wikimapia, 
once  a participant  has  tagged  a location  and  mapped  it,  this  specific  place  is 
no  longer  available  to  other  users  to  carry  out  the  mapping.  This  is  also  true  in 
volunteer  thinking  projects  in  which  participants  help  scientists  in  classifying 
information  online.  In  such  projects,  the  system  allocates  the  images  to  partici- 
pants and,  after  the  image  has  been  viewed  by  a given  number  of  participants, 
it  is  not  shown  anymore.  Therefore,  if  one  participant  becomes  highly  active, 
they  reduce  the  amount  of  work  that  is  left  to  other  participants  to  carry  out. 

Secondly,  the  temporal  aspects  of  the  project  also  play  their  part  in  generating 
participation  inequality.  For  example,  participants  who  joined  OpenStreetMap 
early  on  were  facing  an  empty  map,  in  which  it  was  relatively  easy  to  identify 
and  digitise  objects  such  as  motorways.  Over  time,  the  ability  to  digitise  objects 
rapidly  diminished  as  the  map  became  complete.  For  a volunteer  who  joins  the 
mapping  process  today,  in  many  places  the  effort  that  is  left  requires  adding 
more  intricate  details  of  building  or  address  information.  This  is  also  true  in 
citizen  science,  for  example  in  the  British  Trust  of  Ornithology  (BTO)  breeding 
bird  survey  which  started  over  twenty  years  ago.  A volunteer  that  joined  the 
project  in  the  early  stages  will  have  collected  many  more  records  over  the  years 
than  a person  that  will  join  the  project  today,  who  will  not  be  able  to  catch  up 
to  such  levels  of  recording. 

Thirdly,  another  side  to  the  temporal  aspect  is  demonstrating  the  link 
between  participation  inequality  and  other  social  inequalities.  The  contribu- 
tions of  participants  can  be  translated  into  time  - for  example,  one  of  top  con- 
tributor to  OpenStreetMap  in  June  2015  (Wladyslaw  Komorek)  edited  over 
4.94  million  objects  in  966  active  mapping  days  over  3 years,  contributing  on 
average  about  5,100  points  in  an  active  day.  With  an  assumption  that  it  is  pos- 
sible to  record  2 objects  per  second,  this  represents  an  average  investment  of 
about  hour  in  digitising  only  (without  any  breaks).  This  is,  of  course,  a low 
estimation,  since  such  a participant  spends  time  on  mailing  lists,  meetings  and 
going  out  mapping.  When  considering  that,  across  advanced  economies,  peo- 
ple have  about  36.5  hours  of  leisure  a week  (OECD  2009),  it  is  clear  that,  for  this 
participant,  OpenStreetMap  is  the  most  important  leisure  activity  during  that 
period.  However,  since  leisure  time  is  more  available  to  men,  and  is  reduced  in 
people  with  major  caring  responsibilities,  it  is  more  likely  that  men  with  a well- 
paid  job  will  be  able  to  become  major  contributors  of  VGI  and  in  many  citizen 
science  activities.  Indeed,  many  projects  have  people  with  such  profiles  as  their 
top  contributors  (e.g.  Cooper  & Smith  2010).  However,  one  should  be  careful 
of  sweeping  generalisations  about  the  profile  of  top  contributors,  as  they  are 
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specific  to  projects  and  research  area  - for  example  in  the  EyeWire  project,  in 
which  participants  help  brain  research  by  analysing  the  structure  of  neurons, 
65%  of  top  contributors  are  women  (Kim  et  al.  2014),  while  bird  watching  is 
dominated  by  men  (Cooper  & Smith  2010). 

Fourthly,  access  to  financial  resources  can  have  an  impact  on  the  ability  of 
people  to  become  high  contributors.  For  example,  an  Australian  study  of  bird- 
watchers concluded  that  some  of  them  travel  300  to  1,900  km  from  home  to 
record  an  observation  (Tulloch  & Szabo  2012).  Such  extensive  travel,  apart 
from  dedication,  also  requires  financial  resources.  Other  VGI  and  citizen  sci- 
ence activities  also  involve  purchasing  specialised  equipment  and  dedicating 
time  to  learn  how  to  use  it. 

Finally,  there  is  a need  to  consider  internal  and  external  motivations  of  high 
contributors.  The  various  studies  that  were  mentioned  above,  and  others,  dem- 
onstrate clearly  that  the  top  contributors  represent  a different  demographic 
group  - for  example,  in  EyeWire  they  are  older  than  the  average  participant 
(Kim  et  al.  2014).  Studies  show  that  their  internal  and  external  motivations  play 
an  important  part  in  maintaining  their  engagement  with  a project.  For  some 
participants,  competition  is  a significant  motivation  (Massung  et  al.  2013)  while 
for  others  the  joint  contribution  to  science  is  a major  one  (Nov  et  al.  201 1). 


The  implications  of  participation  inequality 

Based  on  the  analysis  above,  we  can  formulate  a general  rule  for  crowdsourced 
geographic  information:  c When  using  and  analysing  crowdsourced  information , 
consider  the  implications  of  participation  inequality  on  the  data  and  take  them 
into  account  in  the  analysis ! 

As  we  have  seen,  crowdsourced  information,  either  VGI  or  citizen  science, 
is  created  through  a socio-technical  process,  which,  by  necessity,  will  have 
impacts  on  the  final  outputs.  Yet,  all  too  often  it  is  easy  to  forget  the  social 
side  - especially  when  using  the  information  without  paying  due  attention  to 
the  metadata  of  who  collected  it  and  when.  Even  though  analysts  who  use  the 
information  are  aware  that  the  data  source  is  expected  to  be  heterogeneous 
because  of  the  nature  of  the  crowdsourced  process,  it  is  easy  to  forget  participa- 
tion inequality  and  treat  each  observation  as  similar  to  other  observations  and 
assume  they  were  all  produced  in  a similar  way. 

Yet,  data  is  not  only  heterogeneous  in  terms  of  consistency  and  coverage; 
it  is  also  highly  heterogeneous  in  terms  of  contribution,  which  can  have  far- 
reaching  implications  on  quality,  coverage  and  content.  As  we  have  explored, 
the  outcome  is  dependent  on  the  expertise  of  heavy  contributors,  their  spatial 
and  temporal  engagement,  and  even  on  their  social  interactions  and  conduct. 

For  example,  some  of  the  top  contributors  of  OpenStreetMap  naturally  con- 
centrate their  effort  in  the  city  where  they  live.  Knowing  where  these  individ- 
uals are  active  can  help  in  quality  assurance  processes  by  comparing  novice 
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practices  to  their  actions,  potentially  changing  the  number  of  people  that  are 
required  to  map  an  area  well  (Haklay  et  al.  2010).  In  some  projects,  such  as 
iSpot  (Silvertown  et  al.  2015),  in  which  participants  help  in  the  identification  of 
a range  of  species,  there  are  mechanisms  to  reward  high  contributors  with  trust 
marks  and  to  give  their  opinions  more  weight  during  the  identification  process. 

Another  aspect  of  the  impact  of  high  contributors  is  the  social  evolution  of 
the  project.  In  some  projects,  high  contributors  might  exhibit  abrasive  behav- 
iour towards  other  participants  or  protect  ‘their  patch’  (the  area  in  which  they 
operate)  by  aggressively  editing  any  new  information  to  fit  their  standards. 
Such  conduct  is  not  welcoming  to  new  participants,  and  can  impact  on  the 
growth  of  the  project  and  even  its  resilience  in  cases  where  the  high  contributor 
leaves  the  project. 

The  specific  background  and  interests  of  high  contributors  will,  by  necessity, 
impact  on  the  type  of  data  that  is  recorded.  This  is  especially  important  in  VGI 
projects  where  the  details  of  what  to  record  are  left  to  the  participants.  For 
example,  lack  of  interest  in  a class  of  facilities  (e.g.  wheelchair  accessible  toilets) 
will  mean  that  such  information  will  be  lacking  from  the  resulting  dataset  and 
might  shape  the  activities  of  other  participants  (Stephens  2013). 

Interestingly,  while  some  research  analysed  the  biases  that  are  created  by  high 
contributors  (Haklay  2010;  Begin  et  al.  2013;  Mooney  2013),  there  is  relative 
lack  of  attention  within  the  VGI  literature  to  the  wider  impact  that  they  have 
on  the  information  and  on  other  participants. 


Conclusion 

In  this  chapter,  we  looked  at  participation  inequality  and  its  implications  on 
VGI  and  citizen  science  datasets.  We  have  seen  that  participation  inequality  - 
the  phenomenon  in  which  a very  small  percentage  of  participants  contributes  a 
very  significant  proportion  of  information  to  the  total  outcome  - is  persistent. 
It  occurs  across  spatial  and  temporal  scales  and  is  driven  by  multiple  factors. 

Participation  inequality  impacts  on  the  social  and  technical  outcomes  of  a 
project  and,  because  of  that,  it  is  critical  to  remember  the  impact  and  implica- 
tions of  participation  inequality  during  the  analysis  and  use  of  the  information. 
There  will  be  some  analysis  to  which  it  will  have  less  impact  and  some  where 
it  will  have  major  impact.  In  either  case,  it  needs  to  be  taken  into  account.  This 
can  be  done  by  including  an  analysis  of  participation  patterns  early  on  in  the 
analysis  of  a dataset,  and  examining  the  biases  that  are  caused  by  it. 

While  we  can  expect  it,  we  do  need  to  understand  more  about  the  process 
that  created  it  and  its  impact  on  the  resulting  datasets.  There  is  plenty  of  scope 
for  spatio-temporal  analysis  to  identify  the  actions  of  high  contributors  from 
their  early  actions,  and  evaluate  to  what  degree  they  impact  on  other  contribu- 
tors. There  is  also  value  in  more  detailed  analysis  of  how  people  at  different  lev- 
els of  contribution  add  to  the  project  and  whether  there  are  ways  to  encourage 
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people  to  move  between  contribution  groups.  Finally,  the  ethical  and  practical 
implications  of  high  contributors  should  be  assessed,  especially  in  commercial 
VGI  projects. 
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Abstract 

This  contribution  introduces  the  concept  of  Social  Media  Geographic  Informa- 
tion (SMGI)  as  a specific  type  of  Volunteered  Geographic  Information  (VGI). 
Unlike  other  kind  of  VGI,  which  may  originate  from  geographic  measurements 
crowdsourcing,  SMGI  brings  in  addition  a special  potential  for  it  may  express 
community  perceptions,  interests,  needs,  and  behaviors.  Hence,  SMGI  may 
represent  an  unprecedented  resource  for  expressing  pluralism  in  such  domains 
as  spatial  planning,  where  it  may  convey  the  community  collective  preferences 
contributing  to  enrich  knowledge  able  to  inform  design  and  decision  making. 
In  the  light  of  these  assumptions,  the  main  issues  relevant  for  SMGI  collec- 
tion and  analytics  are  presented  from  the  perspective  of  the  spatial  planning 
and  governance  domain,  and  a framework  for  the  SMGI  analytics  in  planning, 
design,  and  decision  making  is  proposed. 
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Introduction 

It  is  often  assumed,  as  a proxy  benchmark  for  the  growing  amount  of  informa- 
tion being  produced,  that  in  2009  more  data  were  generated  by  individuals  than 
in  the  entire  history  of  mankind  through  to  2008.  In  this  avalanche  of  informa- 
tion, social  data  are  increasingly  used  to  build  balanced  relationships  between 
business  and  customers.  This  way,  consumers  are  stimulated  by  successful  web 
actors  to  share  their  data  truthfully  with  the  prize  of  earning  the  power  of  being 
listened  to,  in  addition  to  other  concrete  advantages:  the  online  world  is  begin- 
ning to  be  ruled  by  users’  expectations  (Weigend  2009)  with  major  implication 
for  industries  and  businesses.  Likewise,  social  media  are  growingly  becoming 
a major  arena  for  politics.  We  live  in  a time  where  the  premiere  of  a candi- 
dature to  the  United  States  of  America  presidential  elections  is  expected  to 
be  released  on  Twitter  and  YouTube  channels,  and  where  social  movements 
involving  thousands  of  people  are  organized  on  the  internet.  More  and  more 
public  authorities  are  moving  online  along  the  stream  of  a general  digital  social 
uptake.  Nevertheless,  unlike  mainstream  politics  or  other  sectors  of  govern- 
ment, in  the  domain  of  spatial  planning  and  governance  methods  and  tools 
to  fully  exploit  the  potential  of  social  data,  actively  (i.e.  to  create  discourse)  or 
passively  (i.e.  to  listen  to  the  community),  still  are  not  widely  used.  This  poses 
questions  concerning  the  actual  willingness  of  citizens  to  establish  such  power- 
ful user  relationship  with  public  authorities,  and  what  public  authorities  can  do 
to  meet  citizens’  expectations. 

From  a technical  perspective,  nowadays  many  organizations  rely  on  social 
media  management  tools  to  interact  with  their  customers,  which  in  the  case  of 
public  authorities  include  the  citizens;  however  these  social  engagements  tools 
often  remain  siloed  from  other  enterprise  applications  (Oracle  2013).  As  in 
many  other  domains  in  the  private  and  the  public  sector,  spatial  planning  and 
governance  is  a sub -domain  where  the  integration  of  social  data  with  authorita- 
tive official  information  may  provide  opportunities  for  improving  the  dialogue 
with  the  citizens,  by  not  only  listening  to  their  preferences  but  also  by  monitor- 
ing the  social  processes  they  are  involved  in,  towards  more  pluralist,  informed, 
and  community- oriented  decision  making. 

Unstructured  social  media  contents  that  capture  citizens’  interests,  inten- 
tions, perceptions,  and  needs  may  enrich  traditional  institutional  and  other 
commercial  data  sources.  Key  Performance  Indicators  can  be  developed  to 
monitor  through  real  time  dashboards  the  reactions  of  the  community  to  the 
public  policies  and  actions  helping  to  respond  promptly  to  citizens’  behaviors, 
moving  a step  towards  a new  generation  of  planning  intelligence,  thereby  con- 
tributing to  a more  sustainable  and  smart  growth. 

With  this  premise  the  chapter  is  articulated  as  follows.  In  the  next  section  a 
brief  overview  of  recent  advances  in  digital  spatial  data  sources  argue  knowl- 
edge building  in  planning  is  enriched  by  the  availability  of  social  data.  After- 
wards, a definition  of  Social  Media  Geographic  Information  (SMGI)  is  given  as 
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a special  type  of  Volunteered  Geographic  Information  (VGI).  Then,  in  the  core 
section  of  the  chapter,  a novel  SMGI  analytics  is  proposed  from  the  perspective 
of  spatial  planning  and  governance.  The  conclusions  briefly  summarize  and 
propose  issues  for  further  research  development. 


Social  data  enrich  the  planning  intelligence 

Recent  approaches  to  spatial  planning  and  design  propose  the  concept  of 
Geodesign,  emphasizing  the  role  of  knowledge  about  the  local  territorial 
context  to  inform  design  and  decision-making.  According  to  Steinitz  (2012) 
there  is  no  such  profession  as  the  Geodesigner,  rather  a Geodesign  process  is 
carried  out  through  collaboration  among  different  experts  coming  from  the 
design  disciplines  and  from  the  Geographic  (Information)  Sciences,  as  well  as 
stakeholders  and  other  actors  from  the  local  communities,  or  the  people  of  the 
place.  Unlike  until  a decade  ago,  such  an  approach  is  currently  enabled  thanks 
to  development  both  in  authoritative  and  volunteered  sources  of  geographic 
information. 

On  the  one  hand,  in  an  increasing  number  of  countries  and  regions  develop- 
ments in  Spatial  Data  Infrastructures  (SDI)  are  starting  to  offer  planners  doz- 
ens, and  in  some  cases  even  hundreds,  of  official  large-scale  spatial  data  layers, 
enabling  the  transition  from  analogue  cartography  analysis  to  digital  geo- 
processing  in  the  representation  and  analysis  of  territorial  processes  as  well  as 
in  the  environmental  impact  assessment  of  design  alternatives.  This  is  the  com- 
mon case  in  Europe,  where  the  Directive  02/2007/EU  establishing  the  INfra- 
structure  for  SPatial  InfoRmation  in  Europe  (INSPIRE)  is  promoting  public 
access  to  official  spatial  data  produced  by  public  authorities  at  all  levels  in  the 
Member  States.  Along  the  adoption  and  the  implementation  of  INSPIRE,  in  a 
growing  number  of  regions,  Advanced  Regional  SDIs  are  offering  spatial  data 
and  services  to  professionals  working  on  spatial  planning  and  environmental 
impact  assessment  and  are  thus  starting  to  bring  innovation  into  the  planning 
and  design  practice  (Campagna  & Craglia  2012). 

On  the  other  hand,  the  wealth  of  Volunteered  Geographic  Information 
(Goodchild  2007)  offered  by  geobrowsers  and  widespread  diffusion  of  GPS- 
equipped  handheld  devices,  is  starting  to  represent  a novel  -but  already  must- 
have  - sources  of  information  in  many  fields  according  to  neo -geography  or 
citizens  science  approaches.  OpenStreetMap4  may  be  considered  one  of  the 
most  successful  example  of  GI  crowdsourcing  to  create  a comprehensive  and 
high  quality  open  spatial  dataset  as  a major  alternative  to  more  traditional 
official  or  commercial  sources.  Topography,  networks,  habitats,  biodiversity, 
diseases  spreading,  climate  change,  and  hazards  are  some  of  the  examples  of 
environmental  and  social  processes  being  mapped  by  voluntary  observers 


4 www.openstreetmap.org. 
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acting  as  citizens  sensors.  However,  only  a fraction  of  available  VGI  is  purpose- 
fully produced  and  contributed,  while  an  even  larger  share  is  made  available 
often  as  unaware  results  of  the  use  of  social  media  on  web  and  mobile  apps. 
With  regard  to  the  latter  share,  which  is  the  focus  of  this  paper,  the  next  section 
argues  that  social  media  georeferenced  content  deserves  to  be  treated  individu- 
ally in  research  for  its  peculiar  characteristics,  and  special  focus  is  given  to  its 
potential  as  complementary  knowledge  base  in  spatial  planning  and  govern- 
ance to  support  design  and  decision  making. 


Why  social  data  are  special  when  they  are  spatial? 

Social  Media  Geographic  Information  (SMGI)  can  be  defined  as  any  piece  or 
collection  of  multimedia  data  or  information  with  explicit  (i.e.  coordinates)  or 
implicit  (i.e.  place  names  or  toponyms)  geographic  reference  collected  through 
the  social  networking  web  or  mobile  applications.  Social  data  are  acknowl- 
edged as  a good  of  major  value  in  the  digital  economy,  and  their  potential  for 
enhancing  more  traditional  analytics  is  of  the  utmost  importance.  A big  part  of 
social  data  however  also  features  spatial  (and  temporal)  references,  thus  their 
integration  with  more  traditional  Authoritative  Geographic  Information  (AGI) 
may  enable  a further  step  towards  the  next  generation  of  geospatial  intelligence. 

SMGI  is  a sub-category  of  VGI  and  can  be  active  or  passive,  depending  on  the 
type  of  application  with  which  it  is  collected:  applications  purposefully  created 
and/or  used  to  collect  SMGI  in  participatory  initiatives  (as  in  Campagna  2014) 
originate  active  SMGI,  while  SMGI  harvested  by  general  purpose  social  media 
such  as  Twitter  or  Instagram  are  passive,  and  can  be  considered  more  generi- 
cally  as  user  generated  content. 

Multimedia  content  of  SMGI  may  include  texts,  images,  videos,  or  audios 
in  whatever  combination  usually  aggregated  in  place  marks  or  posts.  Together 
with  spatial  references,  place  marks  and  posts  usually  feature  a time  reference 
and  creator,  or  user  owner. 

Application  Programming  Interfaces  (API)  can  be  used  by  the  public  to  access 
SMGI.  Hence,  the  data  model  of  each  publicly  accessible  sources  depends  both 
on  the  original  data  model  in  the  social  media  owner  database  and  on  the  API. 
In  general,  the  publicly  available  SMGI  data  model  features  a subset  of  the  orig- 
inal attributes  or  multimedia  data,  implying  that  the  analytical  potential  is  in 
general  greater  within  the  social  media  companies  than  for  the  public  (Lazer  et 
al.  2009).  Through  the  APIs,  SMGI  may  be  retrieved  and  accessed  by  keyword, 
by  space,  by  time,  by  user,  or  by  a combination  of  the  former  depending  on  the 
original  social  media  platform  and/or  API.  Data  returned  by  a query  through 
the  API  can  be  converted  in  a spatio-temporal  dataset.  Hence,  spatial- temporal 
analyses  are  supported  on  SMGI  as  well  as  user-behavioral  analysis  (i.e.  the 
analysis  of  user’s  behavior  in  term  of  data  production  and  sharing).  Spatio- 
temporal  and  user-behavioral  analyses  may  be  combined  with  querying  and 
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mining  techniques  on  multimedia  (e.g.  spatio-temporal  textual  analysis  can  be 
defined  as  the  analysis  of  text  in  a given  area  in  a given  moment  or  time  frame) 
creating  new  opportunities  for  SMGI  analytics. 


Towards  SMGI  Analytics  for  spatial  planning  and  governance 

As  introduced  in  the  previous  sections,  it  is  argued  in  this  paper  that  SMGI  may 
turn  out  to  be  a very  valuable  source  of  information  to  support  spatial  planning 
and  design.  However,  a novel  analytics  is  to  be  formalized  for  the  peculiar  data 
models  which  make  this  type  of  information  different  from  more  traditional 
vector  spatial  datasets,  with  which  it  can  be  integrated  (i.e.  AGI;  e.g.  spatial  data 
layer  from  institutional  SDI)  in  order  to  elicit  knowledge  useful  for  informing 
spatial  planning,  design,  or  governance. 

At  the  current  state  of  development,  common  AGI  vector  datasets  available 
for  download  in  national  or  regional  SDIs  as  shapefiles  or  remotely  accessible 
as  Web  Feature  Services,  in  Italy  as  in  other  countries  in  Europe,  feature  a geo- 
graphic and  a thematic  component  or  dimension.  Hence,  they  can  be  repre- 
sented as  follows: 

AGI  = <x,  y,  z;  a>  where  x,  y,  z represent  the  geographic  coordinates,  and  a. 
any  thematic  alphanumeric  attributes  of  the  common  relational  database  data 
types  (i.e.  text,  incl.  URL,  numbers,  or  dates). 

In  contrast,  SMGI  usually  features  a richer  data  model  including  temporal 
and  multimedia  components.  Additionally,  each  piece  of  information  exhibits 
a user  dimension,  which  may  include  an  identifier  as  well  as  other  data  which 
convey  information  on  the  users  profile.  The  latter  plays  a special  semantic  role 
for  the  user  who  produced  and  shared  the  single  piece  of  information  becom- 
ing a dominant  dimension  from  the  analytical  perspective.  In  addition,  each 
piece  of  SMGI  often  features  a score  expressing  agreement  by  or  interest  for  and 
popularity  within  the  virtual  community,  due  to  the  functioning  of  the  majority 
of  the  social  networking  apps. 

Thus,  SMGI  can  rather  be  represented  as  follows: 

SMGI  = <x,  y,  z;  t;  u;  m.;  1>  where  t represents  time  associated  to  each  ele- 
ment of  the  set,  u the  user,  m.  the  multimedia  content  (i.e.  text,  images,  video, 
or  audio  clip),  and  1 the  amount  of  ‘likes  and  dislikes’,  the  number  of  ‘stars’,  or 
any  other  kind  of  popularity  or  agreement  score,  which  indicates  consensus  on 
the  measure  and  should  be  treated  accordingly  in  the  analysis.  Figure  1 sum- 
marizes the  main  features  of  the  AGI  and  SMGI  different  data  models. 

Therefore,  any  SMGI  analytical  framework  should  include  not  only  tradi- 
tional spatial  analysis  but  also  temporal,  multimedia,  and  user  behavioral 
analyses  methods,  and  these  should  be  tightly  integrated  in  order  to  fully 
exploit  the  knowledge  potential  embedded  in  data.  From  a planning  analysis 
perspective,  coupling  these  methods  in  an  integrated  GIS  application  would 
be  an  advantage  in  as  much  GIS  is  (becoming)  the  common  platform  for  the 
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Figure  1:  Scheme  of  the  AGI  (up)  and  of  the  SMGI  (down)  data  models. 


planning  profession  given  the  role  of  maps  in  expressing  knowledge  and  design 
in  this  domain.  With  this  consideration  in  mind,  a framework  for  SMGI  ana- 
lytics have  been  developed  by  the  author  with  the  objective  of  exploiting  this 
new  resource  of  information  in  order  to  enrich  the  knowledge  about  the  local 
context  from  the  social  perspective  and  to  better  support  spatial  planning  and 
governance.  This  framework  is  under  development;  nevertheless  the  current 
results  seem  promising  and  allow  proposing  a tentative  formalization  as  a guide 
for  further  research  in  this  field. 

In  the  remainder  of  the  paper,  this  tentative  framework  for  SMGI  analyt- 
ics is  presented  with  references  to  the  case  studies  developed  at  the  University 
of  Cagliari  under  the  supervision  of  the  author.  The  case  studies  relate  to  two 
major  research  streams.  The  first  one  pertains  the  development  and  use  of  a 
map-based  social  networking  platform,  namely  ‘Place,  I care!’  (PIC!  1.0).  The 
latter  can  be  defined  as  an  active  SMGI  resource  where  users  can  create  thanks 
to  a user-friendly  interface  a private  or  public  project.  Thanks  to  an  advanced 
user  permissions  manager  each  project  can  be  customized  to  the  contextual 
use  case  requirements.  This  way,  accepted  users  can  be  allowed  (or  not)  either 
to  post,  like/dislike,  or  comment,  enabling  different  levels  of  participation  in 
a map  based  discussion.  Simple  query  functions  are  available  in  the  map  inter- 
face and  collected  data  can  be  easily  exported  for  further  analysis  with  GIS 
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packages.  PIC!  1.0  was  used  in  a number  of  pilot  projects  (Campagna  2014; 
Campagna  et  al.  2015). 

The  second  research  stream  concerns  the  collection  and  use  of  SMGI  pro- 
duced by  major  social  networking  platform  such  as  Twitter,  YouTube,  Insta- 
gram,  and  Booking.com  (i.e.  passive  SMGI  resources).  In  both  cases,  SMGI 
datasets  were  collected  in  selected  areas  at  various  scales  (i.e.  from  the  global, 
to  the  regional,  to  the  local)  and  then  integrated  with  other  sources  of  AGI  (e.g. 
from  regional  or  local  SDI)  or  VGI  (e.g.  WikiMapia). 

When  SMGI  is  integrated  in  a GIS  project  with  other  authoritative  and  vol- 
unteered spatial  datasets,  the  peculiarities  of  their  data  models  enable  the  ana- 
lyst to  perform  the  following  analysis  on  SMGI  data: 

• Spatial  analysis  of  user  interests:  thanks  to  the  widespread  use  of  social 
media,  the  high  number  of  georeferenced  posts  enables  us  to  investigate  the 
patterns  of  user  interest  in  space  by  density  (Campagna  2014)  and  clustering 
functions  (Massa  8c  Campagna  2015).  The  overlay  with  topographic  AGI  such 
as  administrative  boundaries,  or  physical  artefacts  such  as  buildings,  infra- 
structure, services  or  public  spaces,  may  offer  useful  hints  to  public  authori- 
ties to  understand  not  only  which  places  are  important  to  the  community  and 
how  they  are  perceived  (Campagna  2014),  and  by  whom  the  community  is 
eventually  composed  (e.g.  local  people,  commuters,  tourist  or  other); 

• Temporal  analysis  of  user  interests:  the  temporal  reference  is  often  an 
available  attribute  in  SMGI,  enabling  to  study  when  given  regional  destina- 
tions, urban  districts,  public  spaces,  or  other  infrastructures  and  services 
are  used  along  the  year,  the  months,  the  week,  or  the  day  (an  example  of  this 
type  analysis  is  given  in  Massa  and  Campagna,  in  this  volume); 

• Spatial  Statistics  of  user  preferences:  collecting  posts  by  spatial  units 
enables  planners  to  analyze  patterns  in  user  interest  at  different  scales. 
An  example  is  given  in  Floris  and  Campagna  (2014),  where  the  hot-spot 
analysis  has  been  used  at  the  regional  level  to  study  the  distribution  by 
municipality  of  positive  user  assessments  by  user  profiles,  where  the  hot- 
spot analysis  has  been  used  at  the  regional  level  to  study  the  distribution  of 
positive  user  assessments  by  user  profiles  to  discover  where  young  vs.  elder, 
or  family  vs.  solo  tourists  prefer  to  go  during  their  trips.  In  the  same  study, 
the  hot-spots  were  then  investigated  with  Spatio-Temporal  Textual  analysis 
(see  STTx  below)  and  with  geographically  weighted  regression  analysis  to 
explore  at  the  local  level  what  physical  and  locational  factors  may  affect  the 
preferences; 

• Multimedia  content  analysis  on  texts,  images,  video,  or  audio:  multime- 
dia analysis  is  well  developed  in  the  case  of  texts  analytics.  However,  it  is 
currently  more  difficult  to  automatically  extract  useful  information  from 
images,  video,  or  audio.  In  the  case  of  text,  many  software  packages  can 
be  used  to  apply  simple  (i.e.  calculating  words  frequency,  or  tag  clouds) 
to  more  advanced  (e.g.  sentiment  analysis)  text  analysis  techniques.  These 
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techniques  can  be  easily  applied  to  subsets  of  SMGI  obtained  by  spatial, 
temporal,  or  user  query  (see  STTx  below); 

• User  behavioral  analysis:  querying  SMGI  by  user  enables  to  study  users’ 
behavior  in  space  and  time.  This  information  can  be  used  to  analyze,  for 
example,  whether  a public  space  is  visited  by  local  people  or  by  visitors.  This 
information  may  be  useful  also  for  profiling:  for  the  users  visiting  a certain 
place  or  service  user  Spatio-temporal  footprints  can  be  defined  to  identify 
people  who  mainly  move  locally,  regionally,  or  internationally,  and  where 
they  come  from; 

or 

• A combination  of  two  or  more  of  the  previous  such  as  in  Spatial-Temporal 
Textual  Analysis  (STTx):  textual  analysis  functions  can  be  integrated 
within  GIS  applications  (Campagna  2014),  enabling  the  application  of  text 
analysis  techniques  to  subset  of  SMGI  selected  by  space  and  time.  Several 
examples  were  tested  by  the  authors  to  analysis  the  perception  of  different 
neighborhoods  in  the  city  using  Place,  I care!  data  (Campagna  2014),  the 
judgment  of  tourists  on  a destination  using  Booking.com  data  (Floris  & 
Campagna,  2014),  or  who  is  talking  about  an  event  of  global  reach  around 
the  world  using  Twitter  and  YouTube  data  (Massa  & Campagna  2014). 
Although  not  verified  yet  by  a systematic  analysis,  several  case  studies  on 
the  application  of  STTx  to  the  same  areas  with  different  SMGI  sources,  thus 
different  type  of  users,  returned  similar  results,  suggesting  further  research 
should  be  devoted  to  better  understand  the  issue  of  representativeness. 


Conclusions 

The  tentative  framework  presented  in  this  paper  derives  from  testing  single 
SMGI  sources  integrated  with  AGI  in  a GIS  environment.  The  research  may 
be  considered  still  in  its  infancy  and  the  SMGI  analytics  framework  proposed 
here  is  likely  to  evolve  substantially  in  the  future,  nevertheless  the  potential 
seems  already  be  very  promising  and  it  offers  many  issues  to  be  further  inves- 
tigated. Early  experiments  seem  already  to  demonstrate  how  it  is  possible 
to  introduce  new  dimensions  of  analysis  in  order  to  build  useful  knowledge 
for  design  and  decision  making  in  spatial  planning  and  governance.  Further 
research  is  under  development  in  order  to  explore  the  potential  of  integrating 
multiple  SMGI  data  sources  together.  Indeed,  each  SMGI  source  has  different 
peculiarities  given  by  specific  data  models  and  public  accessibility  feature.  In 
addition,  each  SMGI  source  has  different  rate  of  diffusion  and  usage  in  different 
regions  making  their  combined  use  unique  to  a given  place,  making  the  case  for 
local  SMGI -AGI  mixes.  This  issue  is  further  amplified  by  the  fact  that  also  AGI 
sources  may  vary  differently  in  different  regions  and  countries.  Nonetheless, 
the  possibility  of  introducing  pluralist  knowledge  on  peoples  perceptions,  pref- 
erences, or  needs  is  an  opportunity  of  utmost  importance  to  bring  innovation 
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towards  more  sustainable,  democratic,  and  community- oriented  design  and 
decision-making  in  spatial  planning  and  governance. 

The  case  studies  referenced  have  already  demonstrated  how  SMGI  data 
enable  us  to  observe  to  the  community  and  to  establish  a dialogue.  From  the 
planning  perspective,  the  possibility  to  tightly  couple  participatory  initiatives 
to  traditional  planning  knowledge  in  an  integrate  environment  is  a promising 
frontier  to  be  further  investigated.  Territorial  marketing,  urban,  and  regional 
as  well  as  sector  planning  and  spatial  governance  in  general  have  now  a new 
power  to  inquiry  the  limit  of  which  seems  to  be  only  the  ability  to  look  at  data 
from  a new  perspective  and  ask  smart  questions. 
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Abstract 

While  spatial  information  quality  is  an  established  discipline  in  traditional  sci- 
entific geographical  information  (GI),  standards  and  protocols  for  representing 
and  assessing  the  quality  of  geographic  contributions  generated  by  volunteers 
or  by  the  generic  web  crowd’  are  still  missing.  This  work  offers  an  analysis 
of  strategies  for  quality  control  and  describes  a simple  representation  of  the 
components  of  the  quality  in  crowdsourced  GI.  In  this  framework,  and  based 
on  the  research  carried  out  in  Criscuolo  et  al.  (2014),  we  also  introduce  a meth- 
odology for  quality  assessment,  based  on  the  given  representation,  which  goes 
beyond  the  limitations  of  previous  methods  in  the  literature  defined  for  a spe- 
cific purpose,  being  able  to  deal  with  many  quality  features,  GI  categories,  and 
types  of  application.  The  method  is  designed  as  a decision  making  approach,  so 
flexible  as  to  take  into  account  the  purpose  of  GI  analysis,  and  so  transparent 
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as  to  make  explicit  the  criteria  driving  to  quality  evaluation,  namely  the  qual- 
ity features  (e.g.  the  credibility  of  the  volunteers,  or  the  accuracy  of  the  spatial 
features,  etc.)  and  their  relevance. 


Keywords 
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Main  issues  in  utilizing  crowdsourced  GI  for 
scientific  purposes 

With  the  development  of  the  geo-web  and  the  increasing  popularity  of  mobile 
devices  and  communication  technologies,  in  the  last  decade  many  geographic 
information  consumers  have  extended  their  role  to  the  most  active  one  of  geo- 
graphic content  producers.  The  geographic  information  generated  so  far,  is 
characterized  by  great  heterogeneity  - both  in  semantics,  formats,  contents, 
and  quality. 

In  fact,  crowdsourced  GI  is  most  frequently  provided  on  the  web  - by  both 
aware  contributors  within  scientific  initiatives  (VGI)  and  unaware  contribu- 
tors within  social  networks  - in  the  form  of  text  commenting  events,  advice, 
warnings  related  with  physical  locations,  geo-tagged  photographs,  and  points 
of  interest  (POIs),  corresponding  for  instance  to  historic,  cultural,  and  natural- 
istic destinations  valuable  for  touristic  or  commercial  purposes.  Contributors 
frequently  provide  also  a geometric  georeferenced  representation  of  the  foot- 
prints of  the  POIs  in  the  form  of  points,  polylines,  or  polygons  (e.g.  centroid 
of  a building,  a polyline  for  a road  or  a trail,  a polygon  for  a park  boundary). 
This  geometric  information  can  be  acquired  by  GPS,  or  by  sensors,  or  special 
equipment. 

These  contributions  arise  the  interest  of  the  scientific  community,  historically 
engaged  in  the  creation  and  distribution  of  geographic  information,  together 
with  concerns  on  the  consequences  that  such  new  practices  can  lead  to  estab- 
lished scientific  disciplines. 

In  fact,  there  are  many  problems  related  to  the  creation  and  use  of  geographic 
information  coming  from  non-traditional  sources  for  scientific  purposes. 

First  of  all,  for  a (spatial)  dataset  to  be  reusable,  it  should  be  coupled  with  its 
metadata,  which  define  the  domain  within  which  its  usage  is  recommended 
(temporal  and  geographic  references,  spatial  resolution,  quality  and  valid- 
ity, constraints,  etc.).  Crowdsourced  geographic  information  is  often  lacking, 
in  whole  or  in  part,  met  a- information  allowing  us  both  to  locate  it  precisely 
in  space  and  time,  and  to  evaluate  the  basic  parameters  for  its  usage,  such  as 
acquisition  procedure,  measurement  accuracy,  instrumental  precision,  time 
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stamps,  contact  details,  etc.  (Sui,  Elwood  & Goodchild,  2013).  In  some  cases, 
some  elements  are  available  to  enrich  the  meta- information  of  the  crowd- 
sourced contribution,  but  they  are  expressed  in  unusual  forms  (for  instance 
authors  nicknames,  tags  and  geotags,  external  links,  attached  Exif  files,  etc.). 

This  issue  especially  emerges  when  datasets  of  user  generated  content  created 
by  their  authors  for  non- scientific  purposes  (social,  promotional,  documental, 
etc.)  are  retrieved,  selected,  and  exploited  in  the  framework  of  scientific  pro- 
jects, for  public  or  governmental  decision  making  purposes. 

The  second  critical  point  arises  when  processing  crowdsourced  geographic 
information.  In  fact,  while  gathering  large  volumes  of  user  generated  geo- 
graphic contributions  is  relatively  easy  (typically  15%  of  social  media  contents 
are  georeferenced),  to  spatially  overlay  and  thematically  integrate  this  informa- 
tion could  be  extremely  difficult.  This  is  due  to  the  different  - or  commonly 
undefined  - instrumental  precision,  reference  systems,  spatial  and  temporal 
granularity,  together  with  the  absence  of  common  attributes  and  conceptual 
schemas,  which  often  make  the  spatial  analysis  of  user-generated  georefer- 
enced data  a burden. 

A third  issue  is  related  to  the  trustworthiness  of  contributed  data.  The  qual- 
ity of  a crowdsourced  contribution  indeed  is  not  just  a characteristic  of  the 
data:  it  is  also  related  to  the  authors  reliability  and  experience,  i.e.  knowl- 
edge of  the  domain  and  ability  in  using  the  tools  for  data  creation.  By  taking 
into  account  these  aspects,  it  is  possible  to  state  the  trustworthiness  of  the 
information. 

The  concept  of  trustworthiness  suits  both  the  conventional  production  of 
expert  scientific  information,  and  the  crowdsourced  contents,  even  if  the  lat- 
ter is  more  complex,  due  to  several  reasons,  among  which  the  difficult  trace - 
ability  of  authors,  their  unknown  reputation,  and  the  lack  of  standards  and 
merit  systems.  In  the  last  decade  several  studies  have  been  focused  on  building 
credibility  models  (Metzger  2007;  Ke(31er  et  al.  2013),  analyzing  quantitatively 
and  qualitatively  user  generated  content  fluxes  on  the  web  by  discussing  their 
intrinsic  characteristics,  sources,  subjects,  drives  (Eysenbach  & Kohler  2002; 
Coleman  et  al.  2009;  Van  Dijck  2009),  and  currently  the  issue  is  still  open  and 
debated. 

Because  of  the  absence  of  a systematic  procedure  for  amateurs  data  produc- 
tion, it  is  commonly  acknowledged  that  official  data  have  a greater  reliability 
and  usefulness  to  science,  while  volunteered  and  non- specialist  data  are  more 
affected  by  inaccuracies  and  contain  less  scientific  value.  Some  authors  have 
spent  efforts  to  prove  - or  contradict  - such  a hypothesis  by  comparing  datasets 
of  crowdsourced  and  specialized  observations.  Dickinson  et  al.  (2010)  reports 
a series  of  studies  in  which  variations  in  observer  quality  are  correlated  to  the 
authors  preparation.  Among  factors  influencing  such  variations  are  back- 
ground and  experience  (Galloway  et  al.  2006)  together  with  the  type  of  task 
(De  Solla  et  al.  2005;  Genet  & Sargent  2003;  Lotz  & Allen  2007),  the  level  of 
training,  the  company  of  a specialist  in  the  field  (Fitzpatrick  et  al.  2009),  and 
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the  age  and  education  of  the  author  (Delaney  et  al.  2007).  On  the  other  hand, 
several  studies  have  shown  that  the  creative,  aggregate  use  of  non-expert  con- 
tributions can  generate  new  valuable  information  (De  Longueville  et  al.  2010; 
Antoniou  et  al.  2010;  Friedland  & Choi  2011),  and  have  documented  situations 
in  which  local  knowledge  or  expertise  provide  information  of  greater  value 
than  the  expert  knowledge  alone  (Fisher  2000).  There  is  evidence  of  the  high 
potential  of  crowdsourced  geographic  information  when  collected  and  man- 
aged in  well- structured  contexts,  also  in  the  results  of  the  analysis  conducted  by 
authors  such  as  Haklay  (2010),  Girres  and  Touya  (2010),  Ciepluch  et  al.  (2010), 
who  have  evaluated  the  accuracy  of  OpenStreetMap  data  against  reference 
sources,  and  found  that  sometimes  crowdsourced  data  have  a better  accuracy 
than  the  reference  datasets. 

Several  other  sensitive  topics  can  be  identified,  related  to  the  scientific  usage 
of  crowdsourced  GI.  Since  an  adequate  treatment  of  these  problems  would  be 
beyond  the  scope  the  paper,  we  just  mention  here  the  complex  issue  of  the 
reproducibility  in  the  procedures,  the  one  of  personal  data  protection,  and 
those  related  to  the  distribution  policy  (establishment  of  intellectual  property 
rights,  copyrights,  and  related  rights). 

In  order  to  make  it  possible  a controlled  use  of  crowdsourced  contributions 
depending  on  the  purpose  of  the  reuse,  the  authors  propose  to  establish  a theo- 
retical framework  for  a flexible  and  transparent  quality  representation  and  han- 
dling (further  analysis  on  this  can  be  found  in  Criscuolo  et  al.  2014,  Bordogna 
et  al.  2014a  and  in  Bordogna  et  al.  2014b):  flexibility  is  intended  to  offer  the 
possibility  to  customize  the  criteria  of  the  quality  assessment  to  different  pur- 
poses and  needs;  transparency  is  intended  to  offer  the  possibility  for  a user  to 
know  the  criteria  used  for  selecting  the  crowdsourced  information. 

In  the  next  sections  the  topic  of  quality  management  for  generic  GI 
is  addressed,  firstly  by  describing  a comprehensive  model  to  represent  GI 
quality,  then  by  discussing  the  approaches  for  its  control,  finally  by  introducing 
a methodology  for  its  assessment,  suitable  for  both  traditional  and  crowd- 
sourced GI. 


Representing  quality  in  crowdsourced  GI 

In  this  work  the  types  of  multimedia  geographic  information  are  grouped  in 
the  following  categories: 

• images:  photographs,  video  recordings  and  graphic  objects; 

• annotations:  mostly  textual  reports; 

• features:  spatial  entities,  mono-  or  multi- dimensional,  with  associated 
attributes  (such  as  Shapefiles  or  Geography  Markup  Language  files); 

• measurements:  values  derived  from  human  or  sensors  observations, 
mainly  as  numbers. 
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The  contributions  expressed  in  form  of  rating,  i.e.  the  public  evaluation  of 
user-contributed  geographic  contents  (e.g.  thumbs  up  / down,  star  ratings. . .), 
are  deliberately  excluded;  in  fact,  these  expressions  are  certainly  informa- 
tive, and  are  widely  used  too,  but  are  more  similar  to  quality  assessment  tools 
than  to  stand  alone  geographic  information.  For  this  reason  in  the  present 
work  the  contributions  in  the  form  of  ratings  will  be  discussed  as  mecha- 
nisms for  quality  control,  i.e.  a kind  of  quality  indicators,  and  not  as  a type  of 
crowdsourced  GI. 

Each  GI  item,  i.e.  each  informative  contribution,  can  consist  of  a single  piece 
or  be  composed  of  multiple  elements.  In  case  of  multiple  elements,  they  may 
belong  to  the  same  category  (for  instance,  they  can  report  measurements  of 
various  physical  parameters  from  a single  measuring  station)  or  be  of  different 
categories  (for  instance  a photograph  and  its  textual  description). 

Once  the  category  and  structure  of  a GI  have  been  described,  it  is  necessary 
to  represent  its  quality. 

The  discussion  on  quality  in  GI  has  a long  history,  which  starts  from  the 
last  century,  deepens  with  the  advent  of  GIS  technology  (for  a comprehensive 
review  refer  to  Van  Oort  2006),  and  finds  a new  flourishing  in  the  last  dec- 
ade, with  the  advent  of  geo-web  and  the  proliferation  of  collaborative  mapping 
applications.  In  fact,  although  the  quality  of  GI  has  been  widely  discussed  and 
has  its  reference  standard  in  ISO  19157:2013  (ISO/TC  211/2010  - Geographic 
information/Geomatics),  the  quality  of  crowdsourced  GI  presents  some  differ- 
ent features,  such  as  to  require  new  indicators  to  be  adequately  described  and 
evaluated  (Van  Exel  et  al.  2010). 

The  quality  of  crowdsourced  GI  is  actually  a composite  property:  it  includes 
not  only  some  aspects  dealing  with  the  characteristics  of  the  data,  but  also 
aspects  dealing  with  the  characteristics  of  the  data  producer  and  with  the  appli- 
cation context. 

In  ISO  19113-15  two  main  categories  of  quality  are  taken  into  account:  Inter- 
nal and  External.  The  first  one  relates  to  intrinsic  characteristics  of  information 
(spatial  accuracy,  temporal  accuracy,  semantic  accuracy...),  while  the  second 
one  deals  with  the  fitness  for  use  of  the  information.  These  categories  are  cer- 
tainly necessary  to  perform  quality  assessment  on  single  pieces  of  information 
or  on  whole  datasets,  but  they  don’t  cover  a third  aspect  of  user  generated  infor- 
mation quality,  which  is  important  especially  for  crowdsourced  resources:  the 
trustworthiness  of  information. 

To  take  into  account  this  complexity,  we  choose  to  describe  the  quality  of 
GI  through  three  main  categories,  inspired  by  the  ISO  19113-15  and  by  the 
thematic  literature: 

• intrinsic  quality,  corresponding  to  ISO  internal  quality,  which  depends  on 
the  characteristics  of  the  informative  content; 

• extrinsic  quality,  which  depends  on  the  characteristics  of  the  context,  and 
responds  to  the  needs  of  assessing  the  credibility  both  on  the  information 
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and  on  the  author  (Flanagin  & Metzger  2008;  Galloway  et  al.  2006;  Genet  & 
Sargent  2003); 

• pragmatic  quality,  first  described  by  English  (1999)  and  similar  to  the  ISO 
external  quality,  which  measures  the  capability  to  meet  the  needs  of  a user 
or  of  a usage. 

The  features  that  contribute  to  determine  the  intrinsic,  extrinsic,  and  pragmatic 
quality  of  a piece  of  geographic  information  can  be  broken  down  into  elemen- 
tary properties.  These  properties  are  many  and  varied,  and  can  be  updated 
under  different  project  conditions.  We  list  only  a few  of  them,  selected  from  the 
most  important  and  most  frequent  in  the  specialist  literature. 

Intrinsic  quality  can  be  described,  for  instance,  by  the  following  elementary 
properties: 

• accuracy,  i.e.  its  conformity  to  the  actual  or  expected  value; 

• precision,  i.e.  the  repeatability  of  the  observation  or  of  the  measurement; 

• correctness,  i.e.  the  absence  of  formal  errors; 

• completeness,  i.e.  the  absence  of  significant  omissions; 

• intelligibility,  i.e.  the  possibility  of  the  contribution  to  be  understood  and 
examined. 

The  elementary  properties  relatable  to  the  extrinsic  quality  can  be: 

• reliability  of  the  information; 

• credibility  of  the  author. 

Finally,  the  pragmatic  quality  can  be  described  by  the  two  following  elementary 
properties: 

• pertinence  of  the  information; 

• fitness  for  a particular  use. 

While  the  elementary  properties  contributing  to  define  the  intrinsic  and  extrin- 
sic quality  can  be  defined  by  evaluating  elementary  quality  indicators  associ- 
ated to  specific  pieces  of  information  constituting  the  VGI  items,  the  last  two 
properties,  pertinence  and  fitness  for  use,  may  both  be  defined  in  terms  of  the 
extrinsic  and  intrinsic  quality  (Bordogna  et  al.  1914b). 

The  representation  of  quality  sketched  in  Figure  1 is  independent  from  the 
categories  of  contributions  (images,  annotations,  measurements,  features), 
from  the  information  content  and  context,  and  can  therefore  be  taken  as  a gen- 
eral framework  to  evaluate  - and  possibly  compare  - the  quality  in  any  crowd- 
sourced or  generic  GI  project. 
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Figure  1:  A sketched  representation  of  the  proposed  categories  and  the  main 
elementary  properties  of  crowdsourced  GI  quality 

Approaches  to  quality  control 

Once  defined  the  categories  of  crowdsourced  GI  and  reported  its  quality  prop- 
erties, we  can  describe  the  types  of  approach  to  quality  control. 

In  temporal  terms  the  approach  to  quality  control  may  take  place  via: 

prevention,  if  it  takes  place  through  procedures  that  precede  or  are  contextual 
to  the  submission  of  information  (e.g.  learning  materials,  controlled  vocabu- 
laries, web  forms  that  guide  data  producers  in  compiling  their  contributions); 
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correction,  if  it  occurs  after  the  contributions  are  submitted  to  the  system 
(e.g.  selection  of  contributions,  automatic  or  manual  corrections). 

From  the  point  of  view  of  the  actors  involved,  the  operations  for  a quality  con- 
trol can  be  carried  out: 

by  the  administration  team,  if  they  are  performed  manually  by  the  project 
coordinators,  a technical  staff  or  a group  of  experts; 
by  the  community  of  participants,  if  the  group  of  volunteers  itself  assesses 
and  validates  the  information  entered; 
automatically,  if  one  or  more  IT  components  of  the  system  operate  the  con- 
tent selection  or  make  some  automated  edits. 

Finally,  from  the  point  of  view  of  the  remedial  action  performed,  the  contribu- 
tions considered  unsuitable  may  be  subject  to: 

warning,  and  then  be  published  with  an  appended  message,  an  alerting  sym- 
bolism, or  a notice; 

removal,  being  excluded  from  publication  and  successive  processing. 

The  described  approaches  to  quality  control  in  crowdsourced  GI  are  repre- 
sented in  Figure  2. 


Figure  2:  A representation  of  the  approaches  to  quality  control  in  crowdsourced  GI. 
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For  each  use  and  each  context  of  application  the  best  fitting  strategy  must  be 
carefully  designed. 

Here  follows  the  discussion  of  the  advantages  and  disadvantages  associated 
with  each  option,  presenting  some  possible  implementations. 

The  preventive  approach,  aimed  to  facilitate  the  correct  compilation  of  VGI 
before  its  submission,  may  consist,  for  example,  in  simple  manuals  and  hand- 
books; in  assisted  completion  procedures,  with  multiple  choice  fields,  word- 
lists,  or  auto -completion;  in  automatic  tools  for  the  normalization  of  contents 
or  for  the  automatic  extraction  of  metadata;  in  ontologies  and  geographic  gaz- 
etteers (Popescu  et  al.  2009;  Kuhn  2001).  Common  preventive  actions  are  also 
the  selection  and  the  training  of  volunteer  contributors  (Galloway  et  al.  2006; 
Crall  et  al.  2011).  All  these  methods  facilitate  a uniform  and  formally  correct 
data  entry  from  contributors  (intrinsic  quality),  but  they  do  not  ensure  to  con- 
trol the  reliability  (extrinsic  quality)  or  the  fitness  for  use  of  the  information 
entered  (pragmatic  quality). 

The  corrective  methods  act  instead  ex  post,  by  amending  or  removing  the 
weak  VGI  contributions.  They  may  include  the  use  of  automatic  algorithms 
or  geostatistical  filters  (De  Tre  et  al.  2010;  Latonero  & Shklovski  2010),  but 
also  can  apply  a human  supervision  to  identify  systematic  errors  and  maintain 
the  consistency  of  the  dataset  (Dickinson  2010;  Huang  et  al.  2010),  or  still  can 
monitor  in  real  time  the  semantic  integrity  of  the  collected  data  (Pundt  2002). 
The  corrective  methods  are  suited  to  act  on  the  reliability  and  effectiveness  of 
the  information  provided  (extrinsic  and  pragmatic  quality),  as  well  as  on  the 
intrinsic  quality  characteristics,  but  since  they  act  for  removing,  merging,  or 
reshaping  inappropriate  contributions,  they  may  cause  partial  or  even  total  loss 
of  information. 

A quality  assessment  performed  by  a team  of  experts,  or  supervisors,  offers 
some  guarantees,  assuming  they  use  competence,  wisdom  and  fairness  in  the 
task.  Yet  even  scientists  or  professionals  cannot  always  enjoy  a full  mastery 
of  all  the  variables,  and  their  judgment  can  be  subjective  and  approximate.  It 
may  happen  that  local  citizens,  or  specialists  in  particular  activities,  or  direct 
observers  of  phenomena  make  more  detailed  and  reliable  assessments  than 
their  scientific  supervisors.  On  some  occasions,  however,  it  is  hazardous  to 
assign  assessment  tasks  to  volunteers.  In  fact,  for  lack  of  expertise,  superficial- 
ity or  bad  faith,  they  could  create  confusion  and  even  hamper  the  entire  data 
collection. 

The  combination  of  the  two  methods  - the  traditional  authoritative  (or  top- 
down),  and  the  democratic  one  (or  bottom-up),  is  not  only  possible,  but  can 
also  produce  significant  results.  In  this  context,  in  fact,  the  web  can  be  used  as 
a meeting  point  for  a collective  assessment:  the  ongoing  access  to  a web  item 
by  a hybrid  team  of  experts,  local  amateurs,  occasional  visitors,  who  are  asked 
for  evaluating  the  content,  can  give  rise  to  a kind  of  participative  evaluation  of 
quality,  with  a high  potential  for  selection  and  judgment  (Flanagin  & Metzger 
2008;  Connors  et  al.  2012). 
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Finally,  the  automatic  control  mechanisms  can  be  extremely  useful,  especially 
when  crowdsourced  contributions  form  large  volumes  of  data  (Spinsanti  & 
Ostermann  2013).  At  present,  however,  automatic  control  systems  rarely  reach 
an  adequate  degree  of  reliability,  comparable  to  a human  validation.  They  are 
likely  to  fail  especially  as  regards  to  the  relevance  of  the  contribution,  the  per- 
tinence (pragmatic  quality),  and  the  intelligibility/correctness  of  the  textual 
content  (intrinsic  quality). 

The  control  mechanisms  that  act  for  removing  flawed  contents  are  of  great 
help  to  preserve  the  integrity  and  the  consistency  of  the  data  collections.  None- 
theless, since  they  discard  information  considered  inadequate  according  to 
pre-set  parameters,  they  lead  to  the  exclusion  or  to  the  partial  loss  of  informa- 
tion that,  no  matter  how  flawed,  might  be  useful  in  other  contexts.  The  control 
mechanisms,  which  keep  the  whole  submitted  information,  even  if  not  compli- 
ant, but  report  the  flaws,  do  not  lose  any  entered  information  and  encourage 
the  users  to  access  the  data  consciously.  These  warning  mechanisms,  however, 
have  two  adverse  consequences:  on  the  one  hand  the  storage  process  is  non- 
effective,  because  it  allocates  some  memory  to  data  of  doubtful  relevance;  on 
the  other  hand  the  usage  is  made  more  difficult,  because  the  system  lets  the  user 
decide  on  the  data  reliability. 

Each  one  of  the  described  options  should  be  considered  and  evaluated  care- 
fully by  the  project  coordinators;  nevertheless,  most  of  the  times  hybrid  meth- 
ods can  help  in  achieving  a proper  management  of  quality,  by  balancing  the 
pros  and  cons  of  the  various  strategies. 


Suggesting  a quality  estimation  method  for  crowdsourced  GI 

Whatever  the  strategies  to  address  the  control  of  quality  in  crowdsourced  GI,  sub- 
sequently it  is  useful  to  define  a method  to  estimate  the  results.  This  estimation  is 
important  not  only  to  establish  the  level  of  quality  reached  by  the  single  contribu- 
tions and  by  the  whole  dataset,  but  also  to  monitor  the  quality  trends  over  time. 

In  recent  years  several  efforts  have  been  made  to  develop  procedures  for  qual- 
ity assessment.  Some  of  them  focus  on  the  credibility  issues  (Metzger  2007), 
some  others  focus  on  the  geographical  accuracy  (Ke(31er  et  al.  2013;  Sabone 
2009;  Goodchild  2008),  which  is  often  calculated  by  comparing  different  data- 
sets or  by  validating  a data  sample  with  a field  survey  (Haklay  2010).  These 
proposals,  while  effective  in  assessing  particular  aspects  of  quality,  are  useful  in 
their  specific  context,  but  do  not  offer  a general  or  flexible  method,  nor  include 
the  different  aspects  that  characterize  the  quality  of  non-traditional  GI. 

To  overcome  these  limitations,  we  base  on  the  representation  of  quality  intro- 
duced in  section  2 which  makes  it  possible  to  define  some  elementary  quality 
indices,  to  be  associated  with  each  component  of  the  GI  items,  and  then  to 
aggregate  the  elementary  indices  into  composite  ones,  until  reaching  an  overall 
index  of  quality  for  the  information  item,  and,  in  case,  the  quality  index  for  a 
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whole  dataset.  A similar  method  has  been  described  in  Bordogna  et  al.  (2014a), 
and  is  proposed  here  in  a simplified  form,  so  as  to  make  it  easily  applicable  and 
customizable  to  any  provided  GI,  i.e.  traditional,  non-expert,  volunteered  or 
even  unaware. 

First,  we  decompose  a generic  GI  item  into  its  elementary  components,  which 
may  consist  of  one  or  more  images,  annotations,  measurements,  geographic 
features.  The  overall  information  of  a contribution,  which  we  name  GITOT,  is 
therefore  achieved  by  aggregating  the  n informative  elements  GI.,  i=l,. . . n. 

GItot  = ® (GIj,  GI2,  GI3,...,  GIn)  ® being  a mathematical  aggregation  operator 

An  overall  quality  index  QTOT  is  then  associated  to  the  overall  information 
GItot.  Qtot  results  from  the  aggregation  of  the  n Q.  indices  associated  with 
the  n components.  In  this  aggregation  step,  each  index  Q.  is  associated  with 
a numerical  weight  K , which  is  properly  chosen  by  the  analyst  depending  on 
specific  design  requirements.  Also  the  aggregation  is  chosen  by  the  analyst,  for 
example  it  may  be  a weighted  average  or  a sum. 

QToT  = ©(K,Q1,K2Q2lK3Q3.-.KQn) 

Each  Q is  in  its  turn  the  result  of  the  aggregation  of  three  quality  indices  - I,  E., 
P - respectively  connected  to  the  intrinsic,  extrinsic,  and  pragmatic  properties 
of  the  GI  quality. 

I E.,  and  P , can  be  in  their  turn  associated  with  three  weights  - K , KE,  and 
Kp  - that  are  also  set  by  the  analyst,  depending  on  the  relevance  stated  for  each 
property  of  GI  quality. 

The  overall  quality  for  a GI  item  results  in: 


Qtot  = © (Kx*®  (Kjlj,  KpEj,  KpPj),  K/0  (KjI2,  KeE2,  KpP2), ...,  K *®  (Kjln,  K£En, 
KpPJ) 


I E.,  and  P can  be  finally  decomposed  in  lower  level  indices,  related  to  the  ele- 
mentary properties  of  GI  quality:  accuracy,  precision,  correctness,  complete- 
ness, intelligibility,  reliability,  credibility,  pertinence,  and  fitness  for  use. 

Even  at  this  level,  the  comprehensive  evaluation  of  I E.,  and  P is  performed 
by  the  aggregation  of  their  lower  components: 


I = ® (accuracy.,  precision.,  correctness.,  completeness.,  intelligibility.) 

E = ® (reliability.,  credibility.) 

P = ® (pertinence.,  fitness  for  use.) 

This  multi  criteria  assessment  of  GI  quality  depends  on  both  the  relevant  qual- 
ity indexes  Q.  (those  with  weight  K >0),  the  number  of  such  relevant  indexes, 
and  the  aggregation  operator  used  to  combine  them. 
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The  described  indices  and  the  progressive  levels  of  aggregation  are  repre- 
sented in  the  Figure  3a. 

In  order  to  clarify  how  the  model  applies  to  real  cases,  we  can  include  as  an 
example  a real  VGI  project  and  make  the  quality  indices  explicit.  Lets  assume 
to  work  as  analysts  in  the  famous  Wikimapia1  project,  and  try  to  estimate  the 
whole  quality  of  a VGI  item,  consisting  in  a polygonal  shape  with  an  annexed 
photo.  The  quality  index  associated  to  the  polygonal  feature  will  be  named  Q , 
and  the  one  associated  to  the  photo  will  be  Q2. 

We  choose  a sum  function  to  perform  the  aggregation  and  set  the  weights  for 
the  two  VGI  components  Q2  and  Q2,  and  for  the  three  quality  indices  I , E.  and 
P , depending  on  our  project  interests,  in  the  following  way: 

Kx  = 1,5  K2  = 1 assuming  more  interest  in  preserving  the  quality 

of  the  polygonal  feature  than  the  quality  of  the  photo; 
Kz  = 1 K£  = 1 Kp  = 0,5  assuming  more  interest  in  controlling  the  intrinsic 

and  extrinsic  quality  than  the  pragmatic  one. 

Now  we  set  some  numerical  values  to  the  elementary  quality  properties,  simu- 
lating a likely  situation  in  the  Wikimapia  project.  Let  us  define  the  numerical 
values  in  the  domain  [-1,0,  1]: 

we  set  the  feature  accuracy  and  the  feature  precision  = 0,  assuming,  in  this 
example,  that  is  not  possible  to  determine  them  directly  in  Wikimapia;2 

we  set  the  feature  correctness,  completeness  and  intelligibility  = 1,  assuming 
they  are  completely  fulfilled; 

we  set  reliability  = -1  and  credibility  = 0,  assuming  that  some  users  from 
the  Wikimapia  community  commented  negatively  the  entered  feature, 
and  assuming  the  author  is  a neophyte  (corresponding  to  user  level  0,  or 
Unregistered  in  Wikimapia); 

we  set  pertinence  and  fitness  for  use  = -1,  assuming  the  polygonal  feature 
entered  is  not  belonging  to  the  categories  requested  in  the  project  (for 
example  it  could  figure  out  the  area  in  which  a temporary  event  takes  place); 

we  set  similarly  the  values  for  the  elementary  quality  properties  of  the  photo- 
graphic component  of  the  VGI  item. 


1 http://wikimapia.org  is  a multilingual  open-content  collaborative  map,  where  volunteers  are 
asked  to  mark  places,  add  descriptions  provided  with  proof  links,  give  them  appropriate  cat- 
egories and  upload  photos. 

2 In  the  literature  some  procedures  have  been  developed  and  adopted  to  calculate  geometrical 
accuracy  and/or  precision  of  VGI  polygonal  contributions,  usually  based  on  a comparison 
with  the  base  map  images.  Nevertheless  these  procedures  can  be  sometimes  challenging  or 
not  applicable.  This  could  happen  for  various  reasons:  for  instance  the  user  generated  polygon 
could  refer  to  a physical  element  that  is  not  completely  visible,  or  not  updated,  in  the  base  map 
images;  sometimes  it  could  be  difficult  to  determine  which  data  is  more  accurate  (the  polygon 
from  the  volunteer  contributor  or  the  base  image  provided  by  the  map  application). 
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Figure  3:  The  theoretical  model  for  estimating  quality  in  a generic  GI  item  (a) 
and  its  enactment  for  a plausible  VGI  case  study  (b). 

The  situation  described  above  is  represented  in  Figure  3b,  as  homologous  to 
the  theoretical  model  in  Figure  3a.  The  numerical  score  resulting  from  the 
simulation  (Figure  3b)  is  a direct  consequence  of  the  initial  decision  - taken 
by  the  imaginary  analyst  - to  assign  equal  weights  to  the  intrinsic  and  to  the 
extrinsic  components  of  the  VGI  item,  and  it  is  directly  connected  with  the 
numerical  domain  stated  ([1,  0,  -1]).  These  choices  lead  to  a slightly  positive 
total  score  (0.5),  which  represents  the  overall  quality  index  for  the  volunteered 
contribution.  Some  alternative  decision  - for  instance  to  assign  a higher  value 
to  the  extrinsic  quality  weight  K£  - would  lead  to  different  and  even  negative 
results.  The  result,  whatever  it  would  be,  is  meaningful  only  if  compared  with 
analogous  ones,  belonging  to  the  same  dataset,  or  to  different  datasets.  The 
procedure  indeed  doesn’t  assess  itself  the  quality  of  the  VGI  items,  but  allows 
normalizing  the  quality  components  in  order  to  facilitate  their  comparison 
and  evaluation.  This  procedure  can  be  carried  out  manually,  automatically  or 
semi-automatically.  The  processed  items  could  be  finally  ranked  and  possibly 
filtered,  depending  on  the  accomplishment  of  a minimum  quality  threshold. 


Discussion  and  conclusion 

The  issue  faced  in  this  work  - i.e.  to  flexibly  represent  and  assess  the  quality  in 
crowdsourced  GI  - has  led  to  define  a methodology  and  a set  of  quality  indices 
suitable  to  represent  and  quantify  quality. 
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A synthetic  representation  of  quality  control  strategies  in  crowdsourced  GI 
activities  has  been  proposed,  as  a guide  for  project  design  and  comparison  of 
existing  ones. 

Such  approaches,  in  practical  cases,  are  often  combined  into  hybrid  strate- 
gies. To  break  them  down  into  their  atomic  properties  helps  to  describe  and 
normalize  such  strategies,  even  in  the  more  complex  operational  cases.  It  seems 
unrealistic  to  point  out  one  single  optimal  solution.  On  the  contrary,  several 
effective  configurations  can  coexist,  offering  suitable  solutions  for  specific  use 
cases.  The  choice  is  usually  determined  by  the  objective  of  the  crowdsourced 
GI  activity  (which  can  be  recreational,  social,  scientific,  professional,  experi- 
mental, etc.),  by  the  type  and  the  amount  of  information  expected  (images, 
annotations,  features,  measurements),  by  the  characteristics  of  the  contributors 
addressed  (citizens,  unaware  web  users,  trained  volunteers,  experts,  etc.),  and 
by  the  infrastructure  and  technologies  on  which  the  project  leans  (geographic 
databases,  web  and  mobile  clients  for  services,  sensors,  etc.). 

The  representation  introduced  in  Figure  2 can  be  used  not  only  a poste- 
riori, i.e.  to  describe  the  control  strategy  performed,  but  can  also  be  helpful 
during  the  design  phase,  to  configure  the  most  effective  solution  for  quality 
management. 

Besides  the  analysis  of  strategies  for  quality  control,  a simple  representation  of 
the  components  of  the  quality  in  crowdsourced  GI  has  been  depicted  too.  It  helps 
in  focusing  on  different  aspects  of  quality  and  individually  evaluating  them. 

A flexible  methodology  for  quality  assessment  has  been  introduced  on  the 
basis  of  the  given  representation.  It  can  be  applied  manually  or  automatically 
on  a wide  range  of  volunteered  contributions,  and  differently  weighted  accord- 
ing to  the  needs  of  the  analysts. 

The  estimation  of  the  quality  of  crowdsourced  GI  is  a challenge  that  has  been 
addressed  by  several  authors  with  different  methods.  The  methods  found  in  lit- 
erature, however,  are  usually  designed  to  respond  to  specific  needs,  and  there- 
fore, as  far  as  valuable  and  useful  in  particular  cases,  appear  to  suffer  from  one 
or  more  of  the  following  major  constraints: 

• they  aim  to  quantify  the  uncertainty  of  a single  quality  feature  (e.g.  the  cred- 
ibility of  the  volunteers,  or  the  accuracy  of  the  spatial  features,  etc.); 

• they  deal  with  a single  category  of  GI  (images,  annotations,  features, 
measurements); 

• they  are  suitable  only  for  a specific  application  (depending  on  a given  technol- 
ogy, or  presuming  the  participation  of  a certain  amount  of  volunteers,  etc.). 

The  contribution  brought  by  this  work  is  the  proposal  of  a generalized  method 
for  estimating  the  quality  that  goes  beyond  these  limitations.  It  is  formally 
defined  in  terms  sufficiently  operational  to  ensure  their  applicability,  but  also 
general  enough  to  ensure  its  transferability  to  different  application  cases. 
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The  proposed  method  is  based  on  some  choices  and  can  be  designed  as  a 
decision  making  approach  including  all  the  major  quality  features  and  allow- 
ing the  description  of  each  type  of  GI  contribution,  under  all  possible  aspects, 
using  different  aggregation  operators  to  get  to  a final  decision.  Its  strength  lies 
in  its  flexibility:  the  aggregation  operations  can  be  chosen  to  suit  the  purpose 
of  the  user’s  analysis,  and  the  decision-making  approach  allows  dealing  with 
any  specific  case.  Even  if  the  analyst  does  not  wish  to  join  to  the  proposed  rep- 
resentation of  quality  (Figure  1),  the  method  is  still  applicable,  by  replacing  the 
suggested  indices  with  alternative  properties.  Finally,  the  method  can  provide 
a guide  to  systematize  and  make  explicit  the  criteria  for  assessing  the  quality 
that  are  used  in  an  application  of  crowdsourced  or  even  heterogeneous  GI. 
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Abstract 

The  last  few  years  have  seen  the  emergence  of  a large  number  of  worldwide 
web  portals  where  volunteers  report  and  collect  observations  of  plants  and 
animals,  share  these  reports  with  other  users,  and  provide  data  for  scientific 
research  purposes  along  the  way.  Activities  engaging  citizens  in  the  collection 
of  scientific  data  or  in  solving  scientific  problems  are  collectively  called  citizen 
science.  Data  quality  is  a vital  issue  in  this  field.  Currently,  reports  of  species 
observations  from  citizen  scientists  are  often  validated  manually  by  experts 
as  a means  of  quality  control.  Experts  evaluate  the  plausibility  of  a report 
based  on  their  own  expertise  and  experience.  However,  a rapid  growth  in  the 
quantity  of  reports  to  be  processed  makes  this  approach  increasingly  less  fea- 
sible, creating  a need  for  methods  supporting  (semi) automatic  validation  of 
observation  data.  This  aim  is  achieved  primarily  by  analysing  the  spatial  and 
temporal  context  of  the  data.  Relevant  context  information  can  be  provided 
by  existing  observation  data,  as  well  as  by  spatial  data  of  environmental  fac- 
tors, or  other  spatio-temporal  factors  impacting  the  distribution  of  species,  or 
the  process  of  observation  and  contribution  itself.  It  is  very  important  that  the 
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specific  properties  of  data  emerging  from  citizen  science  origins  are  taken  into 
account.  These  data  are  often  not  produced  in  a systematic  way,  resulting  in  (for 
instance)  spatial  and  temporal  incompleteness.  Also,  the  data  structure  is  not 
only  determined  by  the  natural  spatio-temporal  patterns  of  species  distribu- 
tion, but  by  other  factors  such  as  the  behaviour  of  contributors  or  the  design  of 
the  citizen  science  project  that  produced  the  data  as  well. 


Keywords 
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Introduction 

Learning  more  about  biodiversity  on  our  planet  has  become  an  important 
challenge  as  we  face  climate  change  and  species  extinction.  Any  conservation 
efforts  need  to  be  based  on  adequate  knowledge  about  distribution,  behav- 
iour, and  ecology  of  species.  However,  the  long-term  data  covering  broad 
geographic  regions,  which  are  necessary  to  gain  said  knowledge  (Dickinson, 
Zuckerberg  Sc  Bonter  2010),  cannot  be  collected  using  professional  data  col- 
lectors alone.  High  costs  associated  with  professional  data  gathering  pose 
another  inhibiting  challenge.  One  way  of  solving  this  problem  is  data  collec- 
tion by  volunteers.  Activities  involving  citizens  in  the  collection  of  scientific 
data  or,  more  general,  in  scientific  research  endeavours,  are  called  citizen  sci- 
ence. While  citizen  science  itself  is  not  a new  phenomenon,  we  see  a growing 
number  of  such  projects  being  organized  in  web  portals,  revolutionising  the 
way  biodiversity  data  are  collected  and  made  available.  Recent  years  have  seen 
a growing  number  of  projects  using  the  possibilities  offered  by  web  2.0  tech- 
nologies (Dickinson,  Zuckerberg  Sc  Bonter  2010;  Miller- Rushing,  Primack  Sc 
Bonney  2012),  where  volunteers  can  upload,  manage,  and  share  their  own 
observations  of  plants  and  animals,  and  make  them  available  for  scientific 
research.  Opportunities  for  biodiversity  monitoring  and  ecological  research 
provided  by  this  phenomenon,  but  also  implications  for  project  organisa- 
tion and  management,  are  extensively  discussed  in  a book  by  Dickinson  and 
Bonney  (2012),  and  in  numerous  other  publications  (e.g.,  Connors,  Lei  Sc 
Kelly  2012;  Chandler  et  al.  2012;  Cosquer,  Raymond  Sc  Prevot-Julliard  2012; 
Sullivan  et  al.  2014).  Motivations  of  initiatives  in  this  field  range  from  further- 
ing public  interest  in  conservation  issues  and  concerns  (with  data  collection 
as  a mere  by-product),  to  systematic  generation  of  such  data  for  specific  uses 
in  scientific  research,  planning  or  public  administration  (e.g.  monitoring  of 
certain  species  or  groups  of  species  in  certain  areas  or  regions).  Other  projects 
aim  at  collection  of  data  about  the  distribution  of  species  without  a prede- 
fined, specific  goal. 
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This  way  of  collecting  data  using  the  www  and  the  general  public  is  a spe- 
cific form  of  crowdsourcing  (Howe  2006),  i.e.  employing  the  general  public  to 
produce  web  content  or  to  carry  out  certain  labour-intensive  tasks  (especially 
tasks  that  cannot  be  easily  automated  using  methods  of  data  processing).  Other 
terms  are  used  in  biodiversity  citizen  science  depending  on  data  collection  pro- 
cedures employed  or  goals  pursued,  such  as  community-based  monitoring  or 
CBM  (Conrad  & Hilchey  201 1).  As  the  data  collected  always  have  a geographic 
reference,  they  represent  a specific  type  of  Volunteered  Geographic  Informa- 
tion (VGI)  (Goodchild  2007;  Haklay  2013). 

One  of  the  most  important  concerns  with  these  data  is  data  quality.  Assur- 
ing data  quality  is  important  because  a general  lack  of  trust  will  decrease  their 
use  for  science  or  administration  (Conrad  & Hilchey  2011).  A recent  study  by 
Theobald  et  al.  (2015)  showed  that  so  far  only  a small  portion  of  biodiversity- 
related  citizen  science  projects  contributed  data  to  peer-reviewed  scientific 
articles.  The  quality  of  the  output  of  scientific  research  depends  directly  on  the 
quality  of  the  data  used  (Dickinson,  Zuckerberg  & Bonter  2010),  as  does  the 
quality  of  administrative  and  planning  decisions.  On  the  other  hand,  citizen 
science  approaches  introduce  great  advantages,  considering  their  ability  to  pro- 
vide large  amounts  of  data  over  broad  geographic  regions  as  well  as  long  peri- 
ods of  time,  often  at  relatively  low  cost  (Dickinson,  Zuckerberg  & Bonter  2010). 
At  the  same  time,  this  poses  a challenge  for  data  quality  assurance:  many  pro- 
jects acquire  large  amounts  of  observations  - often  hundreds  of  observations 
per  day,  or  even  more.  Many  projects  employ  manual  validation  procedures 
that  do  not  scale  well,  making  (semi)automatic  validation  methods  necessary. 

This  chapter  presents  an  overview  of  important  issues  related  to  quality  of  cit- 
izen science  biodiversity  data.  Using  examples  from  citizen  science  projects  in 
the  domain  of  biodiversity,  it  discusses  specific  problems  and  possible  avenues 
to  solutions  concerning  quality  assurance  for  this  specific  kind  of  VGI.  While 
there  are  many  commonalities  with  VGI  from  other  domains,  allowing  for  the 
adoption  of  quality  assurance  approaches  and  strategies  that  are  also  used  in 
other  fields  of  VGI,  there  are  also  notable  differences  or  features  shared  only 
with  few  other  VGI  domains,  making  adjustments  of  common  approaches  and 
strategies  necessary.  Most  important  among  these  differences  are  the  diversity 
concerning  project  design  and  organisation  (from  strict  monitoring  schemes 
to  rather  open,  opportunistic  data  collection,  resulting  in  data  properties  and 
quality  assurance  needs  varying  between  projects),  and  the  nature  of  the  infor- 
mation mapped  (identification  of  species  requiring  some  degree  of  expert 
knowledge,  thereby  raising  issues  of  credibility). 


Quality  of  citizen  science  biodiversity  data 

When  we  examine  the  quality  of  citizen  science  data  from  the  biodiversity 
domain,  we  need  to  look  at  how  data  quality  can  be  defined,  and  how  it  is  used 
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and  handled  in  the  relevant  scientific  practice.  We  approach  the  term  data  qual- 
ity from  two  different  perspectives: 

• Data  quality  in  terms  of  the  sum  of  the  data’s  properties,  and 

• data  quality  in  terms  of  the  data’s  fitness  for  use. 


Data  quality  as  the  sum  of  the  data's  properties 

Observations  of  occurrences  of  species  are  geographic  data.  Therefore,  their 
quality  in  terms  of  characteristic  properties  can  be  described  using  the  qual- 
ity features  introduced  by  ISO  standard  ISO  19113  (ISO  2002).  These  include 
the  following:  completeness,  logical  consistency,  positional  accuracy,  temporal 
accuracy,  and  thematic  accuracy.  Properties  of  the  data  regarding  these  attrib- 
utes are  determined  mostly  by  the  design  of  the  project  collecting  the  data 
(especially  rules  and  guidelines  concerning  data  collection).  Therefore,  they  are 
diverse.  Considering  the  aspect  of  completeness,  we  often  find  a pronounced 
spatial  and  temporal  heterogeneity  in  citizen  science  biodiversity  data.  This 
is  especially  the  case  for  data  collected  without  using  structured  monitoring 
schemes  or  strict  rules  - so  called  casual  data  collected  in  an  opportunistic  way 
(Chapman  2005).  There  are  many  reasons  for  this  heterogeneity,  like  contribu- 
tor preference  for  certain  species  or  groups  of  species,  variable  observation 
effort  caused  by  different  (seasonal)  weather  conditions,  or  differences  in  spa- 
tial density  of  observations  associated  with  differences  in  population  density, 
among  many  others.  Bird  et  al.  (2014)  describe  approaches  to  account  for  the 
variability  in  the  resulting  data  caused  by  such  factors.  They  use  several  statis- 
tical tools  to  demonstrate  effects  of  certain  types  of  error  and  bias  in  citizen 
science  data  on  modelling  results  in  biology,  and  describe  how  to  address  these 
issues.  Van  Strien,  van  Swaay  and  Termaat  (2013)  present  a methodology  to 
remedy  several  types  of  bias  in  the  data  when  using  them  for  occupancy  models 
(modelling  the  distribution  of  species  in  space  and  time). 

The  positional  accuracy  of  citizen  science  species  distribution  data  depends 
primarily  on  the  type  of  location  information,  e.g.  exact  point,  assignment  to 
an  (arbitrary)  area  or  to  a map  quadrant.  The  data  of  many  relevant  projects  are 
heterogeneous  in  this  respect.  The  positional  accuracy  of  point  data  depends 
(among  other  factors)  on  the  way  the  coordinates  of  an  observation  are  deter- 
mined, e.g.  using  a GPS  device  on  site,  placing  the  observation’s  location  on  a 
map  or  aerial  photograph  (in  a map  viewer),  deriving  the  location  from  a speci- 
men description,  etc. 

Thematic  accuracy  refers  to  the  correctness  of  the  classification  of  objects  or 
of  their  non-quantitative  attributes  (Kresse  & Fadaie  2004).  An  important  issue 
regarding  thematic  accuracy  of  observational  data  of  animals  and  plants  from 
citizen  science  projects  is  the  participants’  lack  of  scientific  training  and  its  effect 
on  the  reliability  or  credibility  of  species  identification  (Conrad  & Hilchey  2011). 
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The  temporal  accuracy  of  observational  data  of  animals  and  plants  from  citi- 
zen science  projects  is  determined  by  how  accurately  it  can  be  determined  at 
data  collection.  The  day  of  observation  is  mandatory  information  provided  in 
most  cases.  Sometimes,  the  time  of  day  can  be  specified  as  well,  or  is  recorded 
automatically  if  an  observation  is  reported  using  a mobile  device.  Currentness, 
i.e.  the  correctness  of  data  in  relation  to  the  state  of  the  environment  changing 
over  time,  is  another  important  aspect  of  temporal  accuracy. 

Logical  consistency,  including  aspects  like  consistency  of  data  structure  or 
compliance  with  certain  rules  (Kresse  & Fadaie  2004)  is  usually  ensured  by 
adequate  design  of  the  reporting  tools  and  data  base. 


Data  quality  in  terms  of  fitness  for  use 

Data  quality  in  terms  of  usefulness  can  only  be  assessed  for  a certain  intended 
use  of  the  data  (Devillers  et  al.  2007).  Whether  data  quality  is  good  enough’  for 
a specific  use  depends  on  whether  the  data’s  properties  allow  for  the  question (s) 
at  hand  to  be  answered  (Devictor,  Whittaker  & Beltrame  2010).  For  example,  a 
precise  location  in  observation  data  of  plants  or  animals  is  not  important  if  the 
data  are  used  for  deriving  seasonal  occurrence  for  larger  regions,  but  would  be 
important  for  analysing  fine-grained  spatial  distribution  patterns.  Bordogna  et 
al.  (2014)  point  to  the  need  for  all  VGI  to  assess  and  improve  data  quality  with 
respect  to  the  data’s  intended  use  and  the  data  user’s  expectations.  They  propose 
a framework  to  match  users’  needs  and  data  properties. 


Principles  of  quality  assurance  for  user-generated  data 

Data  quality  assurance  aims  at  identifying,  correcting  and  eliminating  errors. 
Chapman  (2005)  also  uses  the  term  ‘data  cleaning’.  On  the  one  hand,  this  pro- 
cess includes  the  identification  of  formal  errors,  i.e.  missing  values,  typing 
errors,  etc.  On  the  other  hand,  the  suitability  of  a (formally  correct)  data  set  for 
a particular  purpose  depends,  as  we  have  already  seen,  on  whether  the  data’s 
characteristics  (e.g.  position  accuracy)  are  sufficient  for  this  purpose.  Such  uses 
can  be  very  diverse  and  are  often  not  fully  foreseen  prior  to  data  collection 
(Dickinson,  Zuckerberg  & Bonter  2010). 

Goodchild  and  Li  (2012)  identify  three  basic  approaches  to  quality  assurance 
for  VGI,  which  are  also  applicable  to  citizen  science  observation  data  in  the 
field  of  biodiversity. 

The  ‘crowd-sourcing  approach’  builds  on  the  assumption  that  an  error  can- 
not persist  if  many  users  work  on  the  same  data.  Hardisty  and  Roberts  (2013) 
consider  this  the  best  method  to  identify  errors  in  biodiversity  data.  Good- 
child  and  Li  (2012),  however,  present  a good  example  where  this  assumption 
failed,  with  a wrong  name  of  a golf  course  in  California  persisting  for  years  in 
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Wikimapia  (an  online  map  project  collecting  information  on  locations  from 
users).  They  also  conclude  that  what  they  call  obscure  objects  (e.g.  objects  that 
exist  only  for  short  periods  of  time)  may  be  more  susceptible  to  such  errors 
than  others.  Observations,  especially  of  more  mobile  animal  species,  may  well 
be  counted  among  these. 

Another  principle,  termed  the  ‘social  approach’  by  Goodchild  and  Li  (2012), 
uses  privileged  users  as  controllers  validating  the  data  collected  in  the  project. 
This  approach  is  widely  used  in  citizen  science  projects  in  the  biodiversity 
domain  (Wiggins  et  al.  2011).  Data  validators  are  often  regional  experts  for  a 
certain  species  group  (Sullivan  et  al.  2009),  responsible  for  data  validation  in 
a certain  area  that  they  know  well.  The  validation  process  sometimes  involves 
communication  between  data  reviewers  and  observers,  when  a reviewer 
requests  more  specific  information  about  an  unusual  report  (Bonter  & Cooper 
2012)  that  may  help  to  validate  it. 

In  the  ‘geographic  approach’,  Goodchild  and  Li  (2012)  summarize  all  meth- 
ods using  rules  formalising  geographic  context.  As  Elwood,  Goodchild  and  Sui 
(2012:  580)  conclude,  ‘. . . the  richness  of  geographic  context  (...)  makes  it  com- 
paratively difficult  to  falsify  VGI,  either  accidentally  or  deliberately’.  Methods 
based  on  this  principle  allow  for  automatic  verification  of  data.  The  necessary 
geographic  context  can  be  gained  from  observation  data  already  existing  in  the 
project  in  question.  This  approach  requires  large  amounts  of  existing  data  with 
a relatively  high  spatial  density  (Conrad  & Hilchey  2011),  often  not  (or  not  yet) 
available  in  citizen  science  data  sets  in  the  biodiversity  domain.  Consequently 
there  is  a need  for  methods  relying  on  other  context  sources.  Using  external 
context  data  may  provide  a solution  to  this  challenge  (Elwood,  Goodchild  & 
Sui  2012),  adding  the  question  of  data  quality  of  these  context  data  to  the  pic- 
ture. Goodchild  and  Li  (2012)  conclude  that  there  is  a need  for  the  formaliza- 
tion of  relevant  geographic  context  and  the  rules  for  describing  it. 

Using  geographic  context  with  distribution  data  of  organisms  shows  cer- 
tain methodological  similarities  with  niche  or  habitat  modelling,  using  known 
occurrences  or  absences  of  a species  or  of  species  communities  in  order  to 
find  correlations  between  these  occurrences  and  a number  of  environmental 
factors,  with  the  goal  of  predicting  occurrences  (or,  at  least,  finding  suitable 
habitats)  in  regions  without  available  occurrence  data  (Engler  et  al.  2004). 
Many  niche  modelling  methods  need  absence  data  (that  is,  data  about  loca- 
tions where  the  species  in  question  is  definitely  not  present)  to  work  (Engler, 
Guisan  & Rechsteiner  2004).  However,  the  inability  to  provide  absence  data  is 
a notorious  weakness  of  citizen  science  data  in  the  biodiversity  domain,  espe- 
cially if  collected  as  casual  data  in  an  opportunistic  way  (Chapman  2005).  This 
disadvantage  can  be  overcome  (or  at  least  mitigated)  by  using  an  appropri- 
ate project  design  concerning  the  protocols  and  procedures  to  be  followed  at 
observation  data  collection.  A well-established  approach  is  the  use  of  species 
checklists,  allowing  to  differentiate  between  species  that  were  observed  at  a cer- 
tain place  and  time  and  species  that  were  not  (for  example,  the  project  eBird 
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or  the  German  ornitho.de  platform  use  this  method).  Certain  issues  like  the 
detectability  of  species  still  need  to  be  taken  into  account  when  working  with 
this  approach. 


Quality  assurance  for  user-generated  data  from  citizen 
science  projects:  research  and  practice,  shortcoming 
and  possible  solutions 

Wiggins  et  al.  (2011)  conducted  a study  analysing  the  quality  assurance  mecha- 
nisms used  in  citizen  science  projects.  They  found  that  many  projects  assure 
the  quality  of  the  data  produced  by  implementing  suitable  measures  before 
data  collection  (e.g.  project  design,  training  of  participants,  etc.),  while  manual 
validation  of  observation  data  by  experts  is  the  dominant  approach  for  ex  post 
verification  of  data.  The  assessment  of  correctness  (or  ‘truth’)  of  an  observation 
is  based  on  the  plausibility  of  that  observation  in  the  light  of  the  information 
provided  with  the  observation.  The  expert’s  knowledge  about  the  species  and 
the  region  the  observation  comes  from  serve  as  reference  information  for  the 
assessment.  Also,  photographs  are  often  used  as  evidence. 

Some  projects  employ  automatic  assessments  of  the  plausibility  of  observa- 
tions. For  instance,  the  project  eBird,  considered  as  a gold  standard’  source  for 
bird  observations  from  citizen  scientists  for  use  in  scientific  research,  checks 
the  numbers  of  individuals  of  species  specified  by  the  observer  for  plausibil- 
ity, taking  into  account  the  location  and  the  season  (Sullivan  et  al.  2009).  If 
the  numbers  are  considered  implausible,  the  observer  gets  feedback  right  away. 
If  he  or  she  insists,  the  observation  is  passed  on  to  a regional  expert  for  val- 
idation. This  is  also  the  case  for  observations  that  contain  species  not  listed 
in  the  species  checklist  provided  to  the  observer  for  the  location  and  season 
(observers  can  manually  add  species  to  the  list).  eBird  now  also  uses  the  large 
amount  of  data  already  accumulated  in  the  project  to  determine  parameters 
for  its  filter  mechanisms,  improving  filtering  results  concerning  unusual  obser- 
vations (Sullivan  et  al.  2014).  In  the  German  portal  ‘naturgucker’,  observers 
get  hints  from  the  system  if  an  observation  has  certain  properties  making  it 
implausible.  For  example,  the  system  checks  whether  the  reported  species  usu- 
ally occurs  in  the  region  and  at  the  time  the  observation  was  made.  Another 
filter  checks  whether  the  species  has  been  reported  from  that  region  before. 
Reports  of  uncommonly  rare  species  will  also  lead  to  appropriate  feedback  to 
the  observer.  This  project  does  not  flag  reports  or  pass  them  on  to  experts  for 
verification,  leaving  further  data  quality  control  entirely  up  to  the  crowd.  Pro- 
ject Feeder  Watch,  a North  American  bird  monitoring  program,  has  automatic 
filters  very  similar  to  those  of  the  project  eBird,  as  well  using  species  check 
lists  for  regions  and  seasons,  and  numbers  of  individuals  observed.  Bonter  and 
Cooper  (2012)  point  to  the  inability  of  such  filters  to  detect  plausible  but  false 
reports,  and  see  a need  for  more  research  in  this  area.  They  expect  advances 


82  European  Handbook  of  Crowdsourced  Geographic  Information 


through  combining  different  approaches  for  plausibility  assessment,  including 
assessment  of  the  observers’  expertise  or  experience.  Concerning  contributors 
and  their  properties,  Schlieder  and  Yanenko  (2010)  explored  approaches  using 
social  distance  between  contributors  as  a confirming  factor  for  the  reliability  of 
VGI  contributions  closely  related  in  space  and  time.  However,  such  concepts 
are  hardly  applicable  for  citizen  science  data  from  the  biodiversity  domain,  as 
suitable  information  about  contributors  to  measure  their  social  distance  is  very 
rarely  available.  For  an  overview  of  the  data  quality  assurance  strategies  in  pro- 
jects mentioned  in  this  section,  see  Table  1. 

Many  citizen  science  projects  in  the  field  of  biodiversity  collect  observations 
of  plants,  animals  and  fungi  in  an  opportunistic  way,  producing  so  called  cas- 
ual data  without  imposing  strict  rules  or  protocols  on  the  contributors.  Vol- 
unteers contributing  to  such  projects  are  free  to  collect  and  submit  observa- 
tions of  a large  number  of  different  species  at  any  time  and  from  any  place 
(examples  are  the  Swedish  Artportalen  project  and  iNaturalist,  an  American 
project  with  a world- wide  scope;  see  Table  1 for  an  overview  of  their  respective 
data  quality  assurance  strategies).  This  approach  has  the  potential  of  producing 
large  amounts  of  data,  as  the  effort  required  from  volunteers  is  relatively  low, 


Project 

Data  quality  assurance  strategies  and 
options,  in  terms  used  by 

Goodchild  and 
Li  (2012) 

Wiggins  et  al.  (2011) 

eBird 

(http://ebird.org) 

Social  approach 

Filtering  of  unusual  reports, 
contacting  participants  about 
unusual  reports,  expert  review 

Project  Feeder  Watch 

(http://feederwatch.org) 

Social  approach 

Filtering  of  unusual  reports, 
contacting  participants  about 
unusual  reports,  expert  review 

Ornitho.de 

(http://www.ornitho.de) 

Social  approach 

Contacting  participants  about 
unusual  reports,  expert  review 

naturgucker 

(http :/ / www.naturgucker.de) 

Crowd- sourcing 
approach 

Filtering  of  unusual  reports 

Artportalen 

(http://www.artportalen.se) 

Social  approach 

Expert  review 

iNaturalist 

(http://www.inaturalist.org) 

Crowd- sourcing 
approach 

Filtering  of  unusual  reports 

Table  1:  Data  quality  assurance  strategies  and  options  employed  by  the  citi- 
zen science  projects  cited  in  this  chapter,  in  terms  used  by  Goodchild  and  Li 
(2012)  and  Wiggins  et  al.  (2011),  respectively.  Information  about  the  projects’ 
data  quality  assurance  strategies  can  be  found  on  their  web  sites  (see  table). 
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encouraging  participation  and  thus  furthering  high  numbers  of  participants. 
However,  this  kind  of  data  has  increased  needs  for  ex  post  quality  assurance 
and  suitable  data  quality  parameters,  because  the  usefulness  of  such  projects 
and  their  data  for  science,  administration,  and  planning  is  often  questioned 
due  to  a lack  of  ex  ante  quality  assurance  measures  (e.g.  training  of  volunteers, 
implementation  of  monitoring  schemes,  etc.). 

Most  observations  consist  of  at  least  the  species,  location,  time,  and  observer, 
sometimes  supplemented  with  more  (project-specific)  information.  Therefore, 
methods  for  quality  assurance  or  plausibility  assessment  needing  only  the  four 
basic  aspects  of  an  observation  have  the  potential  to  be  useful  for  many  differ- 
ent projects  and  data  sets,  but  data  properties  have  to  be  carefully  examined  in 
any  case.  For  example,  a seemingly  exact  location  in  the  form  of  coordinates 
can  have  a wide  range  of  accuracy,  or  even  represent  different  types  of  locations 
(i.e.  an  exact  location  vs.  the  centre  of  a map  quadrant). 

Conclusion 

The  scientific  studies  cited  in  this  chapter,  as  well  as  the  examples  given,  provide 
an  overview  of  the  most  important  aspects  of  quality  of  citizen  science  data 
from  the  biodiversity  domain  and  its  assurance.  They  show  that  manual  valida- 
tion of  observations  of  species  by  experts  based  on  an  assessment  of  their  plau- 
sibility in  the  light  of  available  context  information  is  the  dominant  approach 
in  citizen  science  projects  in  the  biodiversity  domain.  The  use  of  automatic 
(or  semi-automatic)  approaches  for  plausibility  assessment  is  increasing,  yet 
they  have  important  shortcomings  as  described  in  section  3.  Employing  the 
geographic  context  for  plausibility  assessment  of  crowd-sourced  geographic 
data  has  high  potential  for  assessing  the  plausibility  of  species  observations  in  a 
(semi)automatic  way,  despite  being  rarely  used  so  far.  There  is  a great  need  for 
further  research  on  methods  to  assess  the  plausibility  of  citizen  science  data  in 
the  biodiversity  domain  taking  their  specific  properties  into  account. 
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Abstract 

Vast  swaths  of  geographic  information  are  produced  by  non-professional  con- 
tributors using  online  collaborative  tools.  To  extract  value  from  the  data,  crea- 
tors and  consumers  alike  need  some  degree  of  consensus  about  what  the  entities 
of  their  domain  of  interest  are  and  how  they  are  related.  Traditional  informa- 
tion communities,  such  as  government  agencies,  universities,  and  corporations, 
have  devised  informal  and  formal  mechanisms  to  reduce  the  misinterpretation 
of  the  data  they  rely  on,  curating  vocabularies,  standards,  and,  more  recently, 
formal  ontologies.  Because  of  the  decentralized,  fragmented  nature  of  peer  pro- 
duction, semantic  agreements  are  more  difficult  to  establish  and  to  document 
in  volunteered  geographic  information  (VGI),  severely  limiting  the  re-usability 
and,  ultimately,  the  value  of  the  data.  This  paper  provides  an  overview  of  the 
semantic  issues  experienced  in  VGI,  and  what  potential  solutions  are  emerg- 
ing from  research  in  geo-semantics  and  in  the  Semantic  Web.  The  paradigm  of 
Linked  Data  is  discussed  as  a promising  route  to  handle  the  semantic  fragmen- 
tation of  VGI,  reducing  the  friction  between  data  producers  and  consumers. 
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Introduction 

The  production  of  geographic  information  was,  until  a decade  ago,  the  exclu- 
sive territory  of  professional  surveyors  and  cartographers,  working  for  gov- 
ernments and  private  firms.  The  combination  of  increasingly  powerful,  cheap, 
portable,  and  interconnected  computers  has  opened  up  unforeseen  possibilities 
for  data  collection  and  sharing  beyond  professional  circles,  with  already  tangi- 
ble effects  (Dodge  & Kitchin,  2013).  These  non-professional  mappers  and  car- 
tographers carry  out  their  efforts  on  online  platforms,  producing  digital  arti- 
facts, such  as  maps,  geo-databases,  and  gazetteers.  This  process  can  be  seen  as 
a form  of  collective  communication  about  some  phenomenon  of  interest  (e.g. 
tourist  attractions,  noise  pollution,  or  animal  behavior).  The  communication  is 
mediated  by  machines  through  the  encoding  of  knowledge  from  human  minds 
into  data  and  the  decoding  of  data  back  to  knowledge. 

To  be  able  to  perform  this  process,  the  communities  that  produce  volun- 
teered geographic  information  (VGI)  need  to  devise  a shared  conceptualization 
of  the  portion  of  the  world  they  want  to  capture.  Questions  about  what  entities 
exist,  what  their  attributes  are,  what  relationships  they  have  to  each  other,  need 
answers  with  some  degree  of  consensus.  For  a myriad  of  reasons,  such  a con- 
sensus is  often  hard  to  reach.  The  world  and  its  constituents  can  be  described 
according  to  many  different,  and  equally  valid,  conceptualizations  (Smith  & 
Mark,  2001).  This  problem,  rooted  in  human  cognition  and  communication,  is 
often  called  semantic  heterogeneity’,  and  is  observable  in  the  ubiquitous  vague- 
ness, synonymy,  and  polysemy  in  natural  languages. 

In  this  chapter,  I provide  an  overview  of  the  semantic  challenges  that  VGI 
producers  and  consumers  face  when  describing  and  interpreting  their  data.  I 
cover  this  issue  from  the  perspective  of  geo-semantics,  the  discipline  at  the 
intersection  of  geographic  information  science,  computer  science,  and  knowl- 
edge engineering.  First,  I cover  relevant  work  in  semantics  in  the  context  of 
geography.  Subsequently,  I focus  on  the  specific  context  of  VGI,  and  its  pecu- 
liar challenges.  As  a case  study,  I consider  OpenStreetMap  and  its  community. 
Finally,  from  a more  technological  viewpoint,  I discuss  the  emergent  Linked 
Data  ecosystem,  assessing  its  promises  for  more  transparent,  participatory,  and 
democratic  geographic  information  commons. 


Semantics  and  geographic  information 

Geography  is  pervasive  in  human  experience  and  natural  language.  On  a daily 
basis,  we  navigate  in  and  communicate  about  the  geographic  world,  referring 
to  natural  and  man-made  entities  such  as  roads,  cities,  mountains,  and  rivers. 
Our  intuitive  understanding  of  such  concepts  conceals  the  complexity  that  is 
encountered  when  trying  to  encode  them  in  a digital  form.  The  term  moun- 
tain, for  example,  has  a common-sense  meaning,  but  also  possesses  dozens  of 
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local  and  specialized  definitions  around  the  world  (Janowicz  et  al.  2013).  When 
an  information  system  needs  to  answer  the  question  ‘Where  is  Mount  Ever- 
est?’ there  is  no  single,  context-free,  cross-cultural  way  to  produce  an  answer 
that  will  satisfy  all  users.  The  same  consideration  can  be  applied  to  virtually  all 
natural  geographic  features,  whose  boundaries  are  vague,  seasonal,  or  grad- 
ual. Man-made  features,  while  obviously  exhibiting  more  intelligible  and  crisp 
organization,  are  not  exempt  from  heterogeneity,  and  can  be  described,  catego- 
rized, and  aggregated  through  alternative  and  incompatible  conceptualizations. 

Geo-semantics,  as  a subfield  of  geographic  information  science,  is  concerned 
with  providing  theoretical  and  applied  means  to  handle  the  variations  in  these 
concepts,  with  the  purpose  of  facilitating  the  creation  and  processing  of  infor- 
mation in  computationally  tractable  terms.  Standardization  of  units  of  meas- 
urement, nomenclatures,  and  other  aspects  of  the  geographic  domain  is  indeed 
an  important  way  to  reduce  semantic  friction,  and  has  been  successfully  applied 
to  many  domains,  such  as  the  CORINE  nomenclature  for  land  use.  However,  as 
Janowicz  et  al.  (2013)  argue,  geo-semantics  is  not  about  imposing  standards  for 
what  we  mean  by  ‘mountain,  but  should  be  rather  about  providing  ways  to  pre- 
serve and  handle  the  local  definitions  across  heterogeneous  datasets,  enabling 
precise  translation  mechanisms.  These  vague  geographic  concepts  are  hard  to 
formalize,  and  their  intrinsic  cultural  grounding  makes  them  poor  candidates 
for  long-term  universal  standardization. 

One  avenue  of  research  in  geo-semantics  focuses  on  ontology  engineering  in 
support  of  conceptual  modeling  in  geographic  contexts  (Kuhn  2009).  Unlike 
‘big-o  Ontology’,  a branch  of  Western  philosophy  interested  in  the  deep  struc- 
ture of  reality  and  being,  ontology  engineering  does  not  aim  at  assessing  what 
actually  exists  in  the  world  outside  the  human  mind,  but  has  the  task  of  con- 
straining the  usage  of  terms  in  the  data  towards  the  meaning  intended  by  their 
authors.  The  underlying  intuition  lies  in  the  usage  of  formal  semantics,  such  as 
first  order  or  description  logics,  to  provide  machine-readable,  less  ambiguous 
descriptions  of  entities  and  their  relationships,  which  can  be  used  to  support 
data  sharing,  integration,  and  constrained  forms  of  reasoning.  Insights  from  this 
arena  include  the  formal  clarification  of  identity , rigidity , role , is-a , and  part-of 
relationships,  which  wreak  havoc  when  misunderstood  in  complex  information 
systems  (Guarino  2009).  This  program  bears  similarities  with  traditional  forms 
of  Artificial  Intelligence,  with  which  it  shares  the  formal  approach,  but  differs 
substantially  in  that  it  lacks  the  ambition  to  model  common-sense  knowledge 
through  logic,  aiming  for  more  realistic  and  pragmatic  purposes,  such  as  the 
handling  the  meanings  of  ‘mountain  in  the  Himalayas  and  in  Ireland. 


VGI  and  meaning 

When  even  well-funded  scientific  and  corporate  organizations  struggle  to  han- 
dle semantic  heterogeneity,  it  should  come  as  no  surprise  that  VGI  contributors 
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and  consumers  encounter  substantial  semantic  problems  in  their  work.  In 
online  collaborative  projects,  tensions  between  alternative  conceptualizations 
of  the  same  portions  of  reality  are  common,  as  a quick  exploration  of  Wiki- 
pedia’s talk  pages  would  reveal.  If  they  want  to  create  value,  VGI  contributors 
interested  in  cycling  are  forced  to  confront,  sooner  or  later,  what  exactly  they 
mean  by  cycle  way’,  cycle  lane’,  and  road  quality’,  and  whether  these  concepts 
fit  different  national  and  regional  contexts  different  to  their  own. 

Taking  a broad  stance  on  the  scope  of  VGI,  its  semantic  structures  vary 
from  well-defined  and  curated  geo -referenced  datasets,  such  as  GeoNames,  to 
unstructured  social  media  content  and  blogs.  The  former  are  geo -semantically 
explicit,  having  the  purpose  of  covering  the  entire  world  systematically.  By  con- 
trast, the  latter  contain  large  amounts  of  geographic  information-mainly  vague 
references  to  place  names.  Much  VGI  semantics  lies  between  these  two  extremes, 
for  example  in  the  case  of  folksonomies  and  centralized  tagging  platforms,  such 
as  Tagzania,  WikiMapia,  and  Flickr.  Such  semantic  approaches  consist  usually 
of  a combination  of  a top-down,  centralized  definition  of  a conceptualization  by 
a small  elite,  and  the  emergent  semantics  of  bottom-up,  unrestrained  tagging. 

The  most  popular  VGI  project,  OpenStreetMap  (OSM),  deserves  sepa- 
rate treatment.  This  cartographic  project  is  geographically  explicit,  and  pro- 
duces data  substantially  more  complex  than  that  of  competing  efforts  such  as 
WikiMapia.  The  main  dataset  of  the  project  contains  an  uneven  (but  impres- 
sive) object-based  description  of  the  entire  planet,  including  its  roads,  build- 
ings, parks,  forests,  lakes,  etc.  The  conceptualization  underlying  this  data  is  a 
semi- structured  folksonomy,  documented  on  a wiki  website,1  permitting  the 
creation  of  any  new  term  deemed  necessary  by  users.  For  example,  the  term 
amenity -university  is  used  to  tag  universities.  Rather  than  a fixed  ontology, 
OSM’s  conceptualization  is  a transient,  evolving  product,  open  to  modifica- 
tion and  negotiation.  The  project  experiences  therefore  a tension  between  the 
technical  need  for  a stable  conceptualization,  and  the  desire  of  contributors  to 
express  their  local  knowledge  without  a top-down  interpretation  of  their  world 
being  imposed  upon  them. 

Because  of  its  openness,  OSM  is  an  ideal  resource  to  study  the  semantic  dimen- 
sion of  crowdsourced  cartography.  Using  a combination  of  media,  including  a 
wiki  website,  forums,  mailing  lists,  and  software  tools,  contributors  negotiate 
the  conceptualization  that  underpins  the  data  they  produce,  often  disagreeing 
(Ballatore  2014).  By  exploring  this  digital  corpus,  it  is  possible  to  probe  the  inter- 
connected dimensions  of  the  largely  asynchronous  negotiation  performed  by 
VGI  contributors.  Most  of  the  observable  negotiation  in  OSM  revolves  around 
ontology  engineering,  i.e.  the  extraction  of  an  explicit  conceptualization  from 
tacit  knowledge,  but  in  an  informal,  online  setting  (Ballatore  & Mooney  2015). 
The  dimensions  of  this  negotiation  can  be  summarized  as  follows: 


http : / / wiki,  openstreetmap.  org. 
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Topology  and  mereology.  A topology  is  needed  to  represent  geographic 
entities,  defining  how  entities  can  be  connected  or  contiguous,  grounded 
in  a theory  of  boundaries,  interiority/exteriority,  and  separation.  Addi- 
tionally, a mereology  is  necessary  to  encode  complex  spatial  entities, 
specifying  how  the  parts  relate  to  wholes,  for  example  in  the  case  of  large 
buildings. 

Simplification  and  adaptation.  Many  domains,  such  as  land  cover,  road 
classification,  and  traffic  regulations,  have  been  conceptualized  in 
national  and  international  contexts.  However,  such  conceptualizations 
are  often  too  complex  for  the  scope  of  OSM  and  are  filled  with  technical 
terminology.  In  these  cases,  contributors  choose  an  appropriate  subset 
of  the  conceptualization,  and  adapt  it  to  suit  their  needs.  For  instance, 
in  France,  the  European  CORINE  Land  Cover  nomenclature  has  been 
imported  into  OSM. 

Universalism  and  localism.  A fundamental  tension  arises  between  the  desire 
to  develop  a universal  conceptualization  that  will  be  applied  all  over  the 
world,  and  the  need  to  tap  into  the  heterogeneous  and  local  knowledge  of 
contributors.  Initially,  contributors  attempted  an  Anglo -centric  universal 
conceptualization,  and  subsequently,  facing  an  explosion  of  complexity 
and  spatial  variation,  fragmented  it  into  regional  or  national  schemas. 
Notably,  the  classification  of  roads  has  been  problematic  since  the  incep- 
tion of  the  project,  even  within  the  English-speaking  world.  Similarly, 
contributors  struggle  with  the  complexity  of  the  national  road  legislation, 
resorting  to  translatable  national  schemas. 

Problems  of  equivalence.  The  tension  between  universalism  and  localism 
results  in  problems  of  equivalence  between  languages,  for  example  in 
the  conceptualization  of  restaurants  in  different  countries.  As  indicated 
by  linguistic  translation  theory,  contributors  need  to  express  local  con- 
cepts that  do  not  have  a direct  translation  in  English,  such  as  concepts 
that  depend  on  local  practices,  laws,  and  vocabularies  (e.g.  courthouse). 
Specific  local  entities  are  often  described  into  more  general  English  terms, 
losing  potentially  more  precise  local  knowledge. 

Contested  definitions.  To  constrain  the  intrinsic  vagueness  of  geographic 
terms,  lexical  definitions  of  terms  provide  an  important  normative  tool 
to  construct  a shared  conceptualization.  As  in  other  domains,  lexical  def- 
initions can  help  constrain  the  intended  usage  of  terms,  specifying  the 
necessary  and  sufficient  conditions  for  their  application.  Unsurprisingly, 
conflicts  frequently  arise  about  the  lexical  definitions  in  OSM.  Problems 
occur  when  definitions  are  underspecified,  lacking  necessary  detail,  and 
when  they  are  overspecified,  including  irrelevant  or  confusing  details. 
Definitional  conflicts  result  in  classification  conflicts  in  the  data,  when 
contributors  disagree  on  whether  individuals  fit  a category  or  not  (e.g.  is  a 
building  a church,  a chapel,  or  a cathedral?). 
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Conceptual  granularity.  Information  can  be  expressed  at  different  concep- 
tual granularities,  for  example  describing  a geospatial  entity  as  a generic 
‘tree’  or  as  a ‘Pinus  roxburghii’  For  this  reason,  contributors  often  disagree 
on  the  level  of  detail  to  be  included  in  the  conceptualization.  In  princi- 
ple, infinite  knowledge  can  be  elicited  about  an  entity  from  multiple  per- 
spectives, and  the  choice  of  what  details  should  be  included  is  arbitrary, 
and  driven  by  the  desired  application.  When  a category  is  too  generic,  its 
usage  is  not  constrained  enough  and  different  conceptualizations  emerge. 
Overly  specific  categories  also  cause  problems,  as  they  often  involve  jar- 
gon, and  are  little  used.  The  production  of  VGI  oscillates  between  differ- 
ent levels  of  conceptual  granularity,  in  a balancing  exercise. 


The  promises  of  Linked  Data 

As  I have  argued  so  far,  the  online  production  of  geographic  information  faces 
substantial  semantic  challenges.  VGI  communities  rely  on  open  tagging  and  other 
lightweight  semantic  approaches  to  describe  their  data,  which  result  in  frequent 
inconsistencies,  ambiguity,  and  high  terminological  heterogeneity,  hindering  the 
re-usability  and  interpretability  of  the  data.  The  semantic  friction  encountered  in 
the  production  and  consumption  of  such  data  is  tackled  through  different  top- 
down  and  bottom-up  strategies  to  constrain  the  usage  of  terms  (e.g.  adoption  of 
existing  standards,  lexical  definitions,  etc.).  For  GIScientists,  these  issues  point 
to  exciting  research  questions.  How  can  we  design  conceptual  models  and  tech- 
nologies to  support  communication  about  the  geographic  world,  reducing  the 
gap  between  consumers  and  producers?  How  can  emergent  Web  technologies 
be  harnessed  to  help  contributors  express  their  ideas  in  a clearer,  less  ambiguous 
way  without  imposing  centralized  conceptualizations?  How  can  we  support  the 
expression  and  alignment  of  complex  local  definitions  in  intuitive  ways? 

A promising  answer  to  these  questions  lies  in  the  Linked  Data  paradigm 
(Kuhn  et  al.  2014).  Emerging  from  15  years  of  research  in  the  Semantic  Web, 
Linked  Data  proposes  to  express  information  in  an  inter-linked  data  space, 
built  on  a triple-based  formalism  that  expresses  any  data  as  subject-predicate- 
object  statements  (e.g.  Dublin  is_capital_of  Ireland , European  JJnion  is_a 
Political_entity).  The  dominant  technologies  in  this  arena  are  RDF  (a  simple 
format  to  encode  triples)  and  OWL  (a  logical  language  to  define  ontologies, 
i.e.  formal  specifications  of  conceptualizations).  The  triples  are  hosted  in  dedi- 
cated triple  stores,  which  are  able  to  index,  store,  process,  and  retrieve  triples 
more  efficiently  than  general-purpose  database  management  systems.  Unlike 
traditional  datasets,  linked  entities  must  have  unique  Web  identifiers  (URIs)  to 
enable  humans  and  machines  to  navigate  the  data  space  to  retrieve  definitions 
and  relations  with  other  entities.  To  query  the  triples,  SPARQL  and  its  spatial 
extension  Geo  SPARQL  are  currently  the  most  widespread  choice,  providing  a 
standardized  access  mechanism,  roughly  analogous  to  Web  APIs. 
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As  a toy  example,  let  us  consider  a scenario:  a tourist  wants  to  state  that  they 
took  a picture  of  the  Colosseum  in  Las  Vegas,  and  simply  tags  an  image  file 
with  the  string  ‘Colosseum’,  which  might  refer  to  a Roman  building  in  Rome, 
to  its  kitsch  replica  being  photographed,  or  to  an  obscure  board  game.  Using 
the  linked  data  approach,  existing  entities  that  match  the  string  in  the  open 
knowledge  base  DBpedia  (dbp:Colosseum,  dbp:The_Colosseum_at_Caesars_ 
Palace,  where  ‘dbp:’  stands  for  http://dbpedia.org/resource/),  can  be  suggested 
to  the  tourist,  who  can  then  select  the  appropriate  entity.  The  photo  can  now 
be  described  as  triples  (e.g.  photo_001  is_a  Photograph;  photo_001  represents 
dbp:The_Colosseum_at_Caesars_Palace),  which  can  be  stored  and  processed 
automatically,  inferring  for  example  that  the  Colosseum  is  a theatre  designed 
by  the  firm  Sceno  Plus  Inc.,  enabling  new  avenues  for  data  exploration  and 
reducing  the  potential  misinterpretation  of  the  picture. 

This  simple  idea  has  proved  fruitful  in  both  academic  and  industrial  contexts 
(Heath  & Bizer  2011).  Notably,  several  Linked  Open  Data  (LOD)  initiatives 
have  generated  an  ever- expanding  cloud  of  interconnected  datasets  containing 
billions  of  triples.2  VGI  is  a central  pillar  of  this  ‘online  commons’  of  re-usable 
open  resources,  providing  the  geographic  ground  for  the  organization  of  knowl- 
edge across  domains.  Projects  like  GeoNames,  LinkedGeoData,  and  Geo  Word- 
Net  form  a constellation  of  open  geo-knowledge  bases  (Ballatore  et  al.  2013). 
Major  corporate  actors  such  as  Google  and  Yahoo!  have  also  embraced  Linked 
Data  principles,  offering  increasingly  structured  search  products  based  on  RDF 
knowledge  bases.3  Media  groups  including  the  BBC  and  the  New  York  Times 
publish  part  of  their  informational  assets  as  Linked  Data.  Adopting  a more 
lightweight,  simpler  approach,  Microformats  promote  the  semantic  annotation 
of  people,  places,  products,  reviews,  and  organizations  in  Web  pages,  support- 
ing the  interpretation  of  content,  without  requiring  the  adoption  of  more  com- 
plex Semantic  Web  infrastructure.4 


Conclusions 

In  this  chapter,  I have  summarized  the  challenges  faced  by  VGI  from  a semantic 
perspective.  First,  I discussed  the  conceptual  difficulties  intrinsic  to  the  vague- 
ness of  many  geographic  concepts.  Second,  OpenStreetMap  (OSM)  was  taken 
as  a case  study  to  highlight  the  semantic  issues  that  cause  friction  in  the  process 
of  VGI  production  and  consumption.  VGI  contributors  coordinate  their  efforts 
and  express  information  using  a variety  of  semantic  approaches,  ranging  from 
open  tagging  to  controlled  taxonomies  and  vocabularies.  To  produce  intelligi- 
ble data,  OSM  contributors  make  choices  concerning  topology  and  mereology, 


2 http://lod-doud.net. 

3 https://googleblog.blogspot.co.uk/2012/05/introdudng-knowledge-graph-things-not.html 

4 http://microformats.org. 
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in  a tension  between  universalism  and  localism.  Particularly  in  transnational 
contexts,  the  description  of  entities  encounters  problems  of  equivalence,  result- 
ing in  contested  definitions  of  geographic  concepts,  at  different  granularities. 
To  what  degree  these  aspects  extend  to  other  VGI  projects  is  an  open  research 
question.  As  a promising  way  to  support  the  expression  of  heterogeneous  local 
knowledge  in  VGI,  I have  briefly  discussed  Linked  Data,  a technical  paradigm 
that  has  grown  from  Semantic  Web  research.  Linked  Data  aims  at  constructing 
a data  space  in  which  geographic  entities  can  be  defined  and  described  through 
standardized  and  precise  logical  mechanisms. 

While  I have  presented  reasons  to  support  the  adoption  of  Linked  Data  in 
VGI,  the  open  challenges  and  current  limitations  of  the  approach  cannot  be 
ignored.  The  logic  formalisms  used  in  Linked  Data,  such  as  OWL,  are  rather 
ill-suited  for  spatio-temporal  reasoning  and  need  substantial  extensions.  The 
triple  model,  while  conceptually  attractive,  can  be  very  verbose  to  describe  tra- 
ditional geographic  data  such  as  raster  images,  and  its  structural  complexity 
can  explode  quickly  in  realistic  scenarios.  To  explore  the  current  limitations  of 
Linked  Data,  it  suffices  to  take  a closer  look  at  the  LOD  Cloud,  whose  datasets 
vary  hugely  in  their  interpretability  and  noise,  and  whose  interlinking  is  often 
patchy  and  uneven. 

The  approach  promotes  semantic  clarity  but  cannot  enforce  it.  Without  for- 
mal constraints,  Linked  Data  can  be  as  obscure  and  ambiguous  as  plain  text: 
the  halo  of  clarity  fades  out  as  soon  as  poorly  structured  datasets  are  subject 
to  integration  and  complex  processing.  Finally,  the  tools  available  to  produce 
and  process  Linked  Data  often  lack  usability,  and  substantial  design  efforts 
are  needed  for  deployment  in  VGI  contexts,  providing  intuitive  approaches 
to  encoding  local  knowledge  and  alternative  truths.  None  of  these  issues  are 
insurmountable,  and  they  do  not  outweigh  the  enormous  potential  benefits. 
Ultimately,  the  Linked  Data  paradigm  should  be  considered  as  a promising 
technical  framework  to  mitigate  semantic  problems  in  data  production  and 
consumption,  and  not  as  an  unlikely  fix  to  ancestral  flaws  in  human  communi- 
cation that  are  here  to  stay. 
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Abstract 

The  paper  presents  empirical  research  on  the  quality  of  the  toponyms  that 
can  be  retrieved  from  OpenStreetMap  (OSM)  under  the  purpose  of  enriching 
authoritative  toponymic  databases  and  gazetteers.  An  analysis  on  the  volatility 
of  places  and  points-of-interest  (POIs)  is  presented.  We  examine  how  named 
features  behave  and  change  in  terms  of  type,  name  and  location.  The  challenge 
is  to  understand  the  behavior  and  consequently  the  fitness-for-purpose  of  OSM 
data  when  it  comes  to  a possible  use  and  integration  with  authoritative  datasets. 
We  show  that,  depending  on  the  OSM  feature  type,  the  volatility  can  vary  con- 
siderably and  we  elucidate  which  feature  types  are  consistent,  and  thus  could  be 
used  in  authoritative  gazetteers  despite  their  grassroots  nature  and  if  there  are 
spatial  patterns  behind  the  location  changes  of  features  during  their  lifespan. 
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Introduction 

Gazetteers  are  a vital  component  of  any  spatial  database  irrespectively  of  the  level 
of  detail  used  (i.e.  local,  national  or  international).  Gazetteers  consist  of  a list 
of  toponyms,  a type  and  their  corresponding  geography  This  geography  can  be 
either  a point,  a bounding  box  or  the  footprint  of  the  place.  Gazetteers  are  usually 
used  as  an  entrance  point  to  a spatial  database.  People  start  exploring  geographic 
data  by  providing  a toponym  and  search  for  features,  relations,  maps  or  events 
related  to  that  toponym.  In  other  cases,  the  outcome  of  a spatial  search  is  accom- 
panied by  toponyms  that  facilitate  the  understanding  of  the  result.  There  are  also 
cases  where  the  result  of  a search  is  the  toponym  itself  (e.g.  reverse-geocoding). 
Apart  from  these  practical  examples,  toponyms  and  gazetteers  play  a key  role 
in  many  aspects  of  everyday  life.  Examples  can  be  found  in  explicit  geographic 
applications  like  routing,  mapping  and  cartography  but  also  in  more  general 
cases  such  as  in  government,  legislation,  security  and  policing  etc.  (UN,  2006). 

However,  National  Mapping  Agencies  (NMAs),  which  are  the  de  facto  agen- 
cies responsible  for  creating  and  updating  gazetteers  in  a national  level,  are 
facing  difficulties  in  keeping  toponymic  databases  up  to  date  due  to  the  lack 
of  resources  and  due  to  the  extensive  field  work  needed  for  data  collection  and 
verification.  On  the  other  hand,  Volunteered  Geographic  Information  - VGI 
(Goodchild  2007)  can  serve  as  a promising  alternative  mechanism  for  collect- 
ing toponyms  that  could  enrich  and  update  official  gazetteers  (Goodchild  & 
Hill  2008).  In  this  context,  the  aim  of  this  paper  is  to  examine  whether  OSM 
can  provide  consistent  toponymic  datasets  or  the  grassroots  mechanisms  alter 
constantly  the  spatial  features  in  such  a level  that  hinder  their  use  in  gazetteers. 
More  specifically,  the  research  tries  to  provide  empirical  evidence  on  the  fol- 
lowing questions:  i)  What  is  the  population  and  the  types  of  OSM  features  that 
have  names  and  can  be  used  as  part  of  a gazetteer?  ii)  What  kind  of  changes 
are  taking  place  for  these  features?  iii)  What  feature  types  are  affected  and  how 
much?  iv)  Are  there  any  underlying  spatial  patterns  for  these  changes?  This 
study  adopts  the  definition  about  toponyms  that  is  provided  by  the  United 
Nations  Group  of  Experts  on  Geographical  Names  (UNGEN),  and  thus  the 
scope  of  interest  includes  populated  places,  civil  divisions,  natural  features, 
constructed  features  and  unbounded  places  or  areas  that  have  specific  local 
meaning  (UN  2006:  9). 

OSM  urges  its  contributors  to  provide  names  for  spatial  objects,  if  applicable, 
using  the  name  key  tag.  Contributors  can  add  more  than  one  name  for  spatial 
features  such  as  international  names  or  old  names  by  using  variations  of  the 
name  key  such  as  int_name  or  old_name.  Moreover,  OSM  wiki  pages  provide 
detailed  guidelines  on  how  to  correctly  assign  a name  to  spatial  objects  in  order 
to  achieve  maximum  standardization.  In  our  study  only  the  name  key  tag  has 
been  examined  of  the  point-based  objects  of  two  broad  OSM  categories:  Places1 


1 http://wiki.openstreetmap.org/wiki/Places 
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and  Points  of  Interest  (POIs)2.  These  categories  are  in  accordance  with  what 
United  Nations  (UN)  define  as  a toponym.  The  remainder  of  the  paper  is  struc- 
tured as  follows:  Section  2 presents  briefly  a selection  of  related  work  on  the 
subject.  Section  3 discusses  the  methodology  used  to  collect  and  analyze  OSM 
data.  Section  4 presents  the  results  followed  by  discussion  and  future  work  in 
Section  5. 


Related  Work 

The  importance  of  gazetteers  and  the  challenges  posed  by  the  nature  of  VGI 
data,  and  especially  of  toponyms,  has  drawn  the  interest  of  many  researchers 
and  there  is  extensive  literature  available.  Here,  we  provide  few  examples  of 
VGI  and  authoritative  data  integration  efforts  so  to  highlight  that  VGI  quality 
and  stability  is  an  important  factor  for  this  task.  Such  efforts  range  from  creat- 
ing a gazetteer  by  harvesting  volunteered  big  geo-data  from  Web  sources  (see 
for  example  Gao  et  al.  2014)  to  combining  both  administrative  and  VGI  topo- 
nyms. For  example,  Twaroch  et  al.  (2008)  use  various  web  sources  to  create 
a surface  model  of  the  toponyms’  footprints.  However,  the  authors  highlight 
the  fact  that  it  is  difficult  to  have  crisp  boundaries  when  it  comes  to  VGI  data 
and  that  there  is  a need  to  identify  outliers.  Similarly,  Kefiler  et  al.  (2009a) 
proposed  the  enrichment  of  authoritative  gazetteers  with  toponyms  extracted 
from  geotags  of  photos.  As  the  authors  support,  their  approach  could  benefit 
from  quality  indicators  of  the  geotags  used.  The  quality  of  user-contributed 
data  has  been  also  highlighted  as  a crucial  factor  in  empirical  research  with 
geo-tagged  photos  (see  for  example  Hollenstein  & Purves  2010).  Regarding 
OSM,  Hahmann  and  Burghardt  (2010)  proposed  to  link  OSM  with  GeoNames 
gazetteer  using  semantic  web  techniques  to  produce  an  enriched,  multi-lingual 
gazetteer  and  Smart  et  al.  (2010)  proposed  a methodology  for  the  conflation 
of  toponymic  data  from  multiple  sources,  including  both  authoritative  and 
VGI  datasets,  and  taking  into  account  the  quality  differences  of  each  source. 
However,  as  Mooney  and  Corcoran  (2012)  explain,  developers  of  location- 
based  services  should  be  cautious  when  it  comes  to  using  OSM  data  as  their 
research  on  frequently  edited  features  revealed  considerable  volatility  in  the 
naming  process.  Moreover,  Kefiler  et  al.  (2009b)  underline  the  importance 
that  gazetteers  should  cater  both  for  local  and  small-scale  features,  as  well  as 
timely  and  user-centric  information.  In  this  context,  OSM  has  the  potential 
to  become  a valuable  source  of  toponyms.  Thus  the  discussion  focuses  on  the 
nature,  the  behavior  and  the  evolution  of  the  toponymic  datasets  that  can  be 
retrieved  from  OSM  and  how  these  factors  affect  quality  elements  and  their 
use  in  gazetteers. 


2 http://wiki.openstreetmap.org/wiki/Points_of_interest 
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OSM  Data  Extraction 

Extending  this  line  of  research,  this  paper  focuses  on  the  volatility  of  OSM  fea- 
tures. It  goes  beyond  the  naming  changes  that  Mooney  and  Corcoran  (2012) 
focused  and  examines  also  the  location  changes  of  OSM  features.  The  area 
of  scope  of  the  research  is  Paris  region  (12.012  km2).  The  study  area  is  large 
enough  to  have  a great  diversity  of  named  features,  and  is  quite  complete  due 
to  the  large  number  of  OSM  contributors.  In  order  to  collect  the  necessary 
data,  the  Geofabrick3  shapefile  download  service  was  used.  The  datasets  for  the 
area  of  scope  were  downloaded  at  the  first  week  of  December  2014.  Shapefiles 
include  as  an  attribute  the  unique  OSM_ID  of  every  OSM  feature.  These  IDs 
were  used  in  combination  with  the  OSM  API  to  collect  and  store  in  a Post- 
greSQL/Postgis  database  all  the  versions  of  each  feature.  This  method  provided 
a complete  timeline  of  the  OSM  edits  made  in  the  area  of  scope. 


Analysis 

Descriptive  statistics 

A preliminary  analysis  on  the  availability  of  names  for  the  spatial  features 
(grouped  by  OSM  category)  was  conducted  and  the  results  are  shown  at  Table  1. 

It  can  be  seen  that,  depending  on  the  category,  there  are  considerable  vari- 
ations in  the  presence  of  names.  For  example,  in  the  ‘ Places'  category,  OSM 
contributors  have  assigned  a name  at  almost  all  (i.e.  except  from  3)  features. 
Arguably,  this  behavior  meets  the  expectations  of  an  OSM  user  (including  the 


Category 

Total 

With  names 

% 

Land  use 

36,347 

2,201 

6.1% 

Natural 

20,138 

2,093 

10.4% 

Places 

4,275 

4,272 

99.9% 

Points 

192,228 

53,052 

27.6% 

Railways 

16,482 

4,471 

27.1% 

Roads 

344,870 

152,595 

44.2% 

Waterways 

5,190 

2,520 

48.6% 

Total 

619,530 

221,204 

35.7% 

Table  1:  Total  OSM  features  and  OSM  features  with  names  for  the  study  area. 
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author  of  a gazetteer)  as  it  is  generally  expected  that  all  point  features  classi- 
fied as  ‘ Places'  should  have  a name.  In  contrast,  this  cannot  be  observed  at  the 
'Roads'  category.  Although,  in  reality,  roads  have  a name  (especially  in  urban 
areas  like  Paris)  or  a reference  name  (e.g.  link  to  Motorway  X)  the  percentage 
of  named  features  is  barely  44.2%.  Another  category  that  is  of  interest  for  a 
gazetteer  is  the  one  of  ‘Points’.  This  category  includes  a variety  of  local  features 
that  OSM  contributors  deem  as  Points  of  Interest  (POIs).  Here  the  percent- 
age of  named  features  is  just  27.6%  but,  as  it  will  be  explained  later,  this  factor 
is  not  indicative  of  the  completeness  of  the  dataset  in  terms  of  names  as  for 
many  POIs’  subcategories  a name  tag  is  not  applicable  (such  as  for  ‘crossing’  or 
‘bench’).  Given  these  results  the  research  focused  into  two  categories  that  were 
deemed  as  the  most  interesting  when  it  comes  to  examining  the  potential  to 
create  or  enrich  a gazetteer:  OSM  Places  and  OSM  POIs. 


OSM  Places 

In  terms  of  changes  in  type,  location  and  name,  a Place  point  can  either  remain 
stable  in  its  entire  life-cycle  or  undergo  a change  in  one  or  any  combination  of 
these  three  factors.  In  an  effort  to  understand  whether  the  OSM  data  can  serve 
as  a source  of  consistent  toponyms,  the  percentage  of  the  features  that  have 
been  changed  or  remained  stable  has  been  recorded  (Figure  1). 

It  can  be  seen  that  two  thirds  of  the  features  have  never  been  changed  while 
the  most  common  change  that  features  undergo  is  in  their  geographic  location. 
In  this  context,  the  next  issue  of  interest  was  to  examine  the  types  of  places  and 
their  corresponding  population  versus  the  location  movements  that  took  place 
for  each  'Place'  type.  This  classification  was  used  so  to  examine  which  types 
of  features,  have  been  moved  by  OSM  contributors.  Again,  this  can  give  an 
overview  of  the  consistency  of  OSM  Places.  The  findings  are  shown  in  Table  2. 

The  findings  show  that,  depending  on  the  type  of  place,  there  is  consider- 
able variation  in  terms  of  location  change.  For  example,  while  only  8%  of  the 
features  belonging  to  the  'locality'  type  has  been  moved,  for  the  features  that 
belong  to  the  'town'  type  this  reaches  80%.  Following  this  observation,  the  next 
step  was  to  examine  the  magnitude  of  location  change  (calculated  in  meters) 
for  each  type  of  place.  The  magnitude  of  location  change  is  considered  as  the 
distance  between  two  points:  i)  the  centroid  calculated  taking  into  account  all 
positions  of  the  feature  during  its  life  and  ii)  the  last  position  of  the  features. 
The  results  are  shown  in  Figure  2. 

This  type  of  analysis  can  visualize  the  volatility  in  location  change  of  various 
place  types.  It  can  be  observed  that  entities  with  large  spatial  extends  (either 
crisp  of  fuzzy)  suffer  from  large  changes  in  their  location  in  contrast  with 
smaller  entities.  For  example,  almost  14%  of  all  'towns'  have  moved  over  1,000 
m whereas  for  'suburbs'  21%  of  the  features  have  moved  less  than  100  m and 
65%  remained  stable  (see  also  Table  2). 


70% 
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Figure  1:  Changes  in  Places  taking  into  count  three  factors  (location,  name  and  type). 
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Type 

All  features 

Features  moved 

% 

Allotments 

2 

2 

100% 

City 

1 

1 

100% 

Hamlet 

496 

86 

17% 

Island 

9 

1 

11% 

Islet 

1 

0 

0% 

Isolated_dwelling 

37 

3 

8% 

Locality 

2,096 

167 

8% 

Neighbourhood 

214 

20 

9% 

State 

1 

1 

100% 

Suburb 

130 

45 

35% 

Town 

248 

198 

80% 

Village 

1,039 

487 

47% 

Yes 

1 

0 

0% 

Total 

4,275 

1,011 

24% 

Table  2:  Types  of  OSM  places  and  number  of  OSM  places  that  have  been  geo- 
graphically moved. 


OSMPOIs 

As  noted  above,  from  almost  200K  of  POIs  only  27.6%  of  them  (i.e.  53,052) 
had  a name  attribute.  This  is  an  expected  observation  as  there  are  types  of  POIs 
where  the  name  is  not  an  applicable  attribute.  For  example,  the  POI  types  of 
crossing , bench , traffic_signals  and  survey _points  have  in  total  75,819  spatial  fea- 
tures that  account  approximately  to  the  40%  of  the  total  population,  and  less 
than  0.04%  of  them  (i.e.  only  27  features)  have  names. 

Similar  to  the  Places’  analysis,  the  changes  of  the  same  elements  (i.e.  of  loca- 
tion, type  and  name)  have  been  examined  also  for  the  POIs.  The  findings  show 
that  about  60%  of  features  have  not  been  changed  since  their  creation  while  the 
most  common  change  this  time  is  the  change  in  their  name. 

In  order  to  examine  which  POI  types  are  the  most  volatile  in  terms  of  name 
and  location  change,  a scatter-plot  (Figure  4)  has  been  created.  The  x-axis  in 
Figure  4,  shows  the  percentage  of  features  that  had  a change  in  name  for  the 
30  most  populous  OSM  types.  Name  changes  range  from  minor  changes  (e.g. 
alterations  in  capital  letters  or  blank  spaces)  up  to  changes  in  the  entire  name. 
Although  it  is  not  clear  which  OSM  feature  types  should  be  included  in  a gaz- 
etteer (see  also  discussion  in  Section  5),  Figure  4 shows  that  there  are  types  of 
POIs  that  have  a large  rate  of  name  changes  and  others  that  remain  relatively 
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Figure  2:  The  magnitude  of  location  change  (x-axis,  in  m)  per  type  of  OSM  ‘Place’ 
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Figure  4:  Percentage  of  POIs  that  had  a name  (x-axis)  and  a location  (y-axis)  change. 
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Figure  5:  Box- Whisker  graph  of  the  location  change  for  each  POI  type. 


1 08  European  Handbook  of  Crowdsourced  Geographic  Information 


stable  (e.g.  lower  left  corner).  The  y-axis  in  Figure  4,  shows  the  % of  features 
that  had  a location  change  for  each  OSM  type.  Here,  again,  it  can  be  observed 
that  not  all  POI  types  behave  the  same;  certain  POI  types  suffer  more  than 
others. 

The  combined  view  of  changes  over  these  two  factors  indicates  which  of 
these  types  (if  they  are  to  be  included  in  a gazetteer)  might  be  considered  as  too 
unstable  to  populate  a gazetteer.  However  the  counter  argument  can  be  that  the 
seemingly  stable  behavior  of  certain  POI  types  can  be  explained  by  poor  con- 
tributors attention  and  thus  this  stability  might  indicate  obsolete  or  out-of-date 
features.  In  any  case,  Figure  4 raises  awareness  of  the  futures  behavior  and  gives 
a better  insight  on  what  kind  of  volatility  should  be  expected  per  POI  type. 

After  gaining  a better  understanding  on  which  POI  types  are  volatile  in  terms  of 
name  and  location  change,  the  next  step  was  to  quantify  the  latter.  In  order  to  bet- 
ter visualize  the  position  change,  a Box- Whisker  plot  has  been  created  in  Figure  5 
(note  that  the  upper  quartile  is  not  marked  as  outliers  in  many  types  make  it  draw 
out  of  scale  - for  ease  of  understanding  the  mean  value  has  been  added). 

First,  this  type  of  analysis  can  help  to  understand  which  spatial  features 
should  not  be  modeled  as  POIs  since  the  simple  geometry  of  a point  appears 
not  to  be  the  best  way  to  model  this  physical  entity.  For  example,  motorway 
junctions  seem  not  to  gather  consensus  among  OSM  contributors  regarding 
the  position  of  the  POI  as  the  average  location  change  is  more  than  30  m.  On 
the  contrary,  there  are  POI  types  that  despite  their  location  change,  the  distance 
between  various  locations  remains  well  under  10  m (i.e.  an  arbitrary  positional 
accuracy  threshold  of  hand-held  GPS  devices).  Second,  this  type  of  analysis  can 
highlight  gross  errors  and  outliers  in  OSM  datasets  that  might  downgrade  the 
overall  spatial  quality  of  a dataset.  The  largest  the  distance  between  the  mean 
and  the  upper  level  of  the  second  Quartile  box  (i.e.  the  50%  of  the  features),  the 
more  outliers  and  gross  positional  errors  exist  in  each  category.  For  example, 
8%  of  the  features  for  the  fuel  category  have  been  moved  more  than  100  m 
(with  a recorded  maximum  movement  of  659  m).  Finally,  it  is  made  clear  that 
for  many  POI  types  a clearer  feature  extraction  guide  is  needed.  For  example, 
when  capturing  schools  or  station  it  needs  to  be  clear  for  contributors  where 
the  point  should  be  positioned:  at  the  entrance  of  the  building,  at  the  centroid 
of  the  main  building  or  somewhere  else.  Let  us  mention  that  although  there  are 
instructions  in  the  OSM  wiki  pages  how  to  map  each  feature,  apparently  these 
instructions  are  not  explicit  enough  and  thus  inconsistencies  occur. 


Spatial  patterns 

Finally,  for  the  entire  dataset  of  POIs  a hot-spot  analysis  was  calculated  based 
on  the  location  change  for  each  feature.  A visualization  based  on  Z-score  is 
shown  in  Figure  6.  Hot-spot  analysis  (using  the  Getis-Ord  Gi  statistic  provided 
in  ESRI  ArcGIS  10.2.2)  can  reveal  whether  a phenomenon  is  random  or  not. 
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Here  it  can  be  observed  that  there  are  concentrated  hot  colors  areas  (i.e.  areas 
with  not  random  large  movements)  and  cold  colors  areas  (i.e.  areas  with  not 
random  small  movements)  for  POIs.  While  a first  observation  can  be  made 
that  hot  colors  areas  appear  in  the  popular  and  touristic  area  of  Montmartre 
(north  of  map)  and  the  cold  colors  areas  appear  at  the  periphery  of  Paris  (more 
residential  than  touristic),  further  analysis  is  needed  to  fully  understand  the 
causes  of  the  phenomenon.  For  instance,  Montmartre  is  a hill,  and  the  sources, 
like  the  ortho-rectified  satellite  imagery,  used  for  the  positioning  of  the  POIs 
may  be  less  accurate  there. 


Discussion  and  future  work 

VGI  datasets  are  a dynamic  source  of  spatial  information.  In  particular,  OSM 
datasets,  which  usually  function  as  a proxy  in  the  research  on  VGI  data,  have 
drawn  the  interest  of  researchers  regarding  their  use  in  helping  NMAs  to  com- 
plete or  update  existing  geospatial  products  or  even  to  create  new  ones  (Anto- 
niou  2011).  Improved  and  enriched  authoritative  products  can  be  toponymic 
databases  and  gazetteers.  The  importance  of  gazetteers  in  acquiring  accurate 
results  in  spatial  searches  is  paramount  and  thus  the  update  of  official  gazet- 
teers with  local  knowledge  should  be  made  with  caution  and  meticulous  exam- 
ination of  the  VGI  quality.  Unnecessary  changes  in  the  names,  types  or  the 
geographic  position  (no  matter  how  subtle  or  small)  can  introduce  problems  to 
authoritative  products  or  location  based  services.  However,  once  successful,  the 
presence  of  local  and  community-level  named  features  and  landmarks  can  con- 
siderably enrich  and  improve  gazetteers  and  geospatial  services.  A first  point 
of  consideration  is  the  decision  on  which  types  of  user-contributed  features 
should  be  used  in  a gazetteer.  For  example,  certain  types  of  POIs  are  possible 
to  serve  as  landmarks  that  can  help  to  provide  eloquent  and  easily  understand- 
able routing  directions.  Although  this  paper  does  not  delve  into  the  subject  of 
feature  type  importance,  it  provides  evidence  that  the  selection  of  OSM  types 
and  features  should  be  examined  from  a quality  point  of  view  as  well. 

What  this  paper  has  examined  is  the  behavior  and  thus  the  fitness-for-purpose 
of  OSM  data  as  a source  of  toponymic  data.  The  aim  was  to  examine  whether  the 
OSM  datasets  are  a consistent  datasets  or  the  grassroots  mechanisms  alter  con- 
stantly the  datasets  in  such  a level  that  in  practice  hinder  the  use  of  OSM  data. 
The  findings  show  that  VGI  and  authoritative  data  conflation  is  not  a straight- 
forward process  as  they  differ  considerably  in  nature.  While  authoritative  topo- 
nyms  are  largely  static  and  hard  to  change  spatial  entities,  a considerable  per- 
centage of  VGI  toponyms  undergo  changes.  Not  all  OSM  types  are  fit  to  support 
the  enhancement  of  administrative  gazetteers  as  the  OSM  specification  and  con- 
tribution practices  might  generate  an  unwanted  volatility  in  the  data.  This  obser- 
vation generates  a number  of  questions  that  could  be  the  aim  of  future  work. 
First,  it  is  important  to  understand  the  nature  of  these  changes.  For  example,  do 
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changes  in  location  serve  a better  mapping  outcome,  refer  to  previous  mistakes 
and  thus  are  a spatial  quality  improvements  or  are  they  simply  real-life  move- 
ments that  OSM  contributors  capture?  Relating  movement  to  the  geographic 
extent  of  the  named  feature,  or  to  some  contributor  pattern  would  be  useful 
to  understand  how  and  why  the  changes  occur.  Using  what  Goodchild  and  Li 
(2008)  call  the  geographic  approach  to  assess  named  features  movement  would 
also  be  useful:  e.g.  check  whether  a Place  feature  that  refers  to  a town  has  been 
move  to  the  centroid  of  the  town  hall.  Similarly,  it  could  be  examined  if  there  are 
any  time  patterns  in  the  changes.  For  example  are  these  changes  concentrated  at 
the  early  period  of  the  creation  of  a feature  and  thus  it  is  an  indicator  of  quality 
improvement  (as  discussed  in  Haklay  et  at.  2010)  or  are  they  happening  during 
the  entire  life-cycle  of  each  feature  and  indicate  an  endogenous  volatility  of  the 
spatial  feature?  Nevertheless,  contributors  might  alter  OSM  features  (no  matter 
what  the  reason)  and  this  change  can  either  be  very  small  and  thus  authorita- 
tive products  and  services  that  have  integrated  OSM  data  will  not  be  affected  or 
might  be  large  enough  to  introduce  unwanted  volatility.  Finally,  it  is  of  interest 
to  compare,  in  terms  of  completeness,  the  OSM  toponyms  with  authoritative 
data  so  to  understand  at  what  extend  VGI  data  can  help  NMAs  to  improve  their 
gazetteers.  Thus,  future  work  will  include  the  comparison  between  OSM  and 
authoritative  toponyms  (provided  by  IGN  France,  the  French  NMA). 
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Abstract 

With  the  increasing  importance  of  VGI  for  GIScience,  data  quality  becomes 
an  issue  of  high  concern.  Particularly  in  collaborative  mapping  projects,  when 
a group  of  public  participants  acts  to  collect,  update  and  share  information 
about  geographic  features,  aiming  to  maintain  and  improve  a geo-spatial  data- 
set. OpenStreetMap  (OSM)  is  the  most  common  VGI  project  that  aims  to 
develop  free  world  digital  map.  Although  several  studies  emphasized  the  posi- 
tional accuracy  and  completeness  of  the  OSM  data,  particularly  in  the  urban 
areas,  they  also  highlighted  its  problematic  thematic  accuracy.  In  this  chapter, 
we  handle  the  thematic  accuracy  quality  measure  from  the  facet  of  classifica- 
tion. This  chapter  presents  an  approach  for  rule-guided  classification  for  VGI 
projects.  The  proposed  approach  exploits  the  availability  of  data  to  learn  the 
distinct  characteristics  of  a set  of  geographic  features.  Afterwards,  the  learned 
characteristics  are  used  to  guide  the  contributors  toward  the  most  appropriate 
data  classes,  aiming  to  improve  the  data  quality.  The  approach  consists  of  two 
phases:  Learning  and  Guiding  phases.  During  the  Learning  phase,  data  mining 
algorithms  are  applied  to  learn  the  geographic  characteristics  of  specific  fea- 
tures. The  learning  process  results  in  a set  of  rules  describing  these  features.  The 
extracted  rules  are  used  to  develop  a classifier.  Afterwards,  during  the  Guiding 
phase,  the  developed  classifier  is  used  for  several  purposes;  1)  acts  to  detect 
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problematic  classified  entities;  and  2)  guides  and  aids  the  contributors  during 
the  classification  process.  An  empirical  study  followed  by  an  implementation 
is  conducted.  The  results  show  the  feasibility  of  the  proposed  approach  and 
highlight  some  limitations  that  could  be  improved  in  the  future  studies.  The 
developed  tool  generates  promising  results  and  improves  the  classification  of 
OSM  dataset  as  well. 


Keywords 

Volunteered  Geographic  Information  (VGI),  Spatial  Data  Quality,  Thematic 
accuracy,  Spatial  data  mining. 


Introduction 

Crowd- sourcing,  the  advance  of  web  technologies  and  the  availability  of  loca- 
tion sensing  devices  empower  the  public  to  produce  contents  associated  with 
implicit  or  explicit  spatial  references.  This  form  of  User  Generated  Contents 
(UGC)  has  been  known  as  Volunteered  Geographic  Information , in  which  a 
group  of  people  voluntary  acts  to  collect,  update,  and  share  spatial  information 
(Goodchild  2007).  VGI  changes  the  conventional  way  of  mapping  activities 
resulting  in  collaborative  mapping.  Those  activities  were  exclusively  reserved  - 
for  a long  time  - for  mapping  agencies  and  specialized  organizations.  However, 
in  collaborative  mapping,  participants  are  eager  to  collect  information  about 
geographic  features  producing  maps  (Gillavry  2004).  Among  others,  Open- 
StreetMap1  (OSM),  Wikimapia2  and  Google  MapMaker3  are  examples  of  col- 
laborative mapping  projects.  OSM  is  the  most  prominent  example  of  a VGI 
project;  it  aims  to  develop  a free  digital  map  of  the  world  editable  and  available 
to  everyone.  During  the  last  decade,  several  applications  and  services  have  been 
developed  based  on  VGI  data  including  - but  not  limited  to  - urban  planning, 
environmental  monitoring,  crises  management,  map  provision,  etc. 

Despite  the  increasing  utilization  of  VGI  data,  its  questionable  quality  still 
makes  it  - in  some  cases  - of  limited  use  (Elwood  et  al.  2012;  Flanagin  & Metzger 
2008).  Among  other  reasons,  contributors  diversities  and  the  fixable  contribu- 
tion mechanisms  are  resulting  in  data  of  heterogeneous  quality  (Mooney  & 
Corcoran,  2012).  Several  studies  assess  VGI  data  by  comparison  with  authori- 
tative data  sources.  They  conclude  the  promising  completeness  and  positional 
accuracy  of  OSM  data,  particularly  in  urban  areas  (Haklay  2010;  Neis  et  al. 
2011).  In  Hecht  and  Stephens  (2014),  the  authors  highlight  the  declining  of 


1 http://www.openstreetmap.org/ 

2 http://www.wikimapia.org/ 

3 http://http://www.google.com/mapmaker/ 
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data  quality  with  the  increasing  distance  form  urban  areas.  Regarding  particu- 
lar features,  Girres  and  Touya  (2010),  Haklay  and  Weber  (2008)  and  Ludwig 
et  al.  (2011)  emphasize  the  quality  of  street  networks  in  France,  the  UK  and 
Germany  respectively.  Whereas  Arsanjani  et  al.  (2015)  and  Arsanjani  and  Vaz 
(2015)  address  the  promising  contributions  of  land  use/land  cover  features  in 
OSM  datasets.  The  studies  highlight  the  heterogeneous  data  quality  not  only 
regarding  the  positional  accuracy  and  completeness  quality  measures,  but  also 
regarding  the  problematic  thematic  accuracy  of  data  (Haklay  2010;  Neis  et  al. 
2011;  Vandecasteele  & Devillers  2013).  Thematic  accuracy  implies  correctness 
of  the  assigned  classification  to  a given  entity  with  that  entity’s  characteristics 
and  its  geographic  context.  Hence,  this  chapter  tackles  VGI  data  quality  from 
the  classification  perspective. 

In  OSM  projects,  the  loose  classification  mechanisms  lead  to  inappropriate 
classification  of  data.  Whether  a piece  of  land  covered  by  grass  is  classified  as 
park , meadow  or  forest,  if  a water  body  is  classified  as  pond  or  lake , whether  an 
area  is  classified  from  the  land  use  perspective  as  residential  or  industrial , etc. 
All  these  classifications  mainly  depend  on  contributors’  perception  ( subjective 
classification). 

Otherwise,  the  appropriate  classification  should  reflect  the  inherent  geo- 
graphic characteristics  of  an  entity  ( objective  classification).  For  example,  park 
and  garden  are  likely  used  for  entertainment  and  should  contain  amusement 
facilities  like  a playground,  sport  area,  etc.  and  a lake  is  likely  surrounded  by  a 
natural  landscape  and  some  facilities  like  tracks  or  benches,  and  is  larger  in  size 
than  a pond ; whereas  residential  areas  mostly  cover  residential  buildings  and 
likely  contain  some  residential  services,  whereas  industrial  areas  usually  have 
industrial  properties  like  a company,  factory,  etc. 

In  this  chapter,  we  propose  a rule-guided  classification  approach.  The 
approach  aims  to  improve  the  data  classification;  it  works  to  develop  a recom- 
mendation system  able  to  guide  the  contributors  towards  appropriate  classifica- 
tion. The  approach  works  to  extract  the  distinct  geographic  characteristics  of 
a specific  feature  and  encode  them  in  the  form  of  rules.  The  rules  are  encoded 
together  into  a classifier.  Afterwards,  the  developed  classifier  is  applied  to  guide 
the  contributors  towards  appropriate  classifications. 

As  an  empirical  study,  we  address  the  classification  of  some  grass-related  fea- 
tures; where  a piece  of  land  covered  by  grass  could  be  classified  as  forest,  garden, 
grass , meadow  or  park.  The  classification  of  these  features  generates  a challenge; 
they  are  commonly  covered  by  grass,  however  each  class  has  its  distinct  char- 
acteristics. For  example,  the  park  and  garden  classes  have  entertainment  char- 
acteristics, th e forest  class  are  usually  covered  with  trees  or  other  woody  vegeta- 
tion and  the  meadow  class  has  agricultural  characteristics,  etc.  The  findings  are 
promising  and  show  the  feasibility  of  the  approach. 

This  chapter  is  organized  as  follows:  the  2nd  section  gives  insights  into  the 
classification  challenges  in  an  OSM  project,  while  the  3rd  section  presents 
the  proposed  approach  of  guided  classification  for  VGI.  An  empirical  study 
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is  presented  in  the  4th  section.  The  last  section  summarizes  the  findings  and 
points  out  the  future  research  directions. 


Classification  Challenges  at  OpenStreetMap 

The  OSM  project  is  the  most  prominent  collaborative  mapping  project:  it  cov- 
ers most  of  the  world,  has  more  than  2 million  registered  users  on  October 
20 154  and  the  OSM  data  is  utilized  in  various  services  and  applications.  How- 
ever, its  problematic  classifications  make  its  data  of  limited  use  (Devillers  et  al. 
2010).  In  particular,  the  problematic  classification  results  in  inaccurate  results 
and/or  incomplete  answers.  The  uncertainty,  poor  definitions  and  various  indi- 
vidual conceptualizations  of  geographic  features  are  other  reasons  behind  the 
problematic  classification  of  data  (Fisher  1999;  Grira  et  al.  2010).  However, 
regarding  the  OSM  project,  the  problematic  data  classification  might  come 
back  to  the  following: 

• Contributors’  heterogeneity:  the  project  harnesses  the  contributors 
diversities  to  produce  rich  datasets.  However,  these  diversities  influence 
the  resulting  data  quality  (Coleman  et  al.  2009);  contributors  have  various 
geographic  and  cartographic  knowledge;  this  fact  results  in  heterogeneous 
perceptions  of  the  geographic  features  and  consequently  problematic  clas- 
sifications; what  is  perceived  by  a contributor  as  a park  could  be  considered 
by  another  as  a grass  or  garden  type  area. 

• Contribution  methodologies:  OSM  supports  the  contributors  hetero- 
geneity by  providing  different  methods  of  contribution.  The  most  popu- 
lar contribution  methods  are  either  by  uploading  GPS  tracks  directly  or 
by  editing  geographic  features  over  satellite  images.  The  later  method  is 
the  most  common  and  is  known  as  remote  contribution.  Figure  1 illus- 
trates, a remote  contribution  (armchair  contribution)  process,  in  which  a 
contributor  uses  an  editor  (e.g.  JOSM)  to  contribute  information  about  a 
specific  feature  by  tracking  the  feature  on  satellite  images.  The  contribu- 
tion method  itself  generates  a challenge  during  the  classification  process 
(Mooney  & Corcoran  2012).  For  example,  in  Figure  1,  the  pieces  of  grass- 
covered  land  look  similar  and  their  classifications,  whether  park,  garden, 
grass  or  meadow,  mainly  depend  on  the  contributors  perception  and  need 
some  sense  of  locality.  Moreover,  the  loose  tagging  mechanism  of  OSM  also 
results  in  problematic  classifications.  There  is  no  restriction  on  the  num- 
ber of  tags  associated  with  a certain  entity;  an  entity  could  be  associated 
with  no  tags  or  several  tags  with  endless  combinations  without  any  integrity 
checking  mechanism.  For  example,  an  entity  could  plausibly  be  tagged  with 
leisure-park,  natural-grass,  landuse-meadow  and  place-garden. 


4 http://osmstats.neis-one.org/. 
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Figure  1:  Remote  contribution  using  JSOM  editor. 


• Ambiguous  recommendations:  the  OSM  project  provides  only  recom- 
mendations for  contributors  through  its  Wiki5  pages.  These  recommenda- 
tions resulted  from  discussions  between  mappers.  However,  it  is  probable 
that  most  of  the  contributors  do  not  spend  enough  time  checking  these 
recommendations.  Furthermore,  due  to  ambiguous  terminologies  (e.g. 
wood  or  forest,  landuse  or  landcover , etc.),  some  recommendations  might 
be  conceptually  misinterpreted,  particularly  by  non-experts.  For  example, 
the  unclear  distinction  between  lake  and  pond  classes  results  in  a new  class 
of  lake;  pond. 

The  previous  points  summarize  the  fundamental  reasons  behind  the  problem- 
atic data  classification  of  OSM.  These  challenges  come  up  due  to  the  nature 
of  VGI  and  the  OSM  project  in  particular.  There  exist  other  reasons  due  to 
the  nature  of  geographic  data  as  well.  Most  geographic  features  are  not  well 
defined;  the  fact  that  results  in  crisp  boundaries  between  classes.  In  some 
cases,  an  identical  feature  could  plausibly  belong  to  multiple  classes.  However, 
small  details  usually  exist  and  distinguish  between  conceptually  overlapping 
classes.  In  the  case  of  remote  contribution,  these  details  are  hardly  recognized 


5 http://wiki.openstreetmap.org/wiki/Map_Features 
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by  armchair  contributors,  and  consequently,  they  contribute  either  imprecise 
or  incomplete  data. 


Rule-Based  Guided  Classification  Approach 

To  tackle  the  classification  challenges,  we  propose  a rule-based  guided  classi- 
fication approach.  Through  guiding  and  recommendations,  the  approach  aims 
to  produce  data  with  appropriate  and  consistent  classifications.  The  approach 
consists  of  two  phases:  Leaning  and  Guiding  phases. 


Learning  Phase 

During  the  Learning  phase,  the  approach  employs  the  increasing  availability 
of  OSM  data  in  learning  the  characteristics  that  distinguish  between  simi- 
lar classes.  Figure  2 shows  a summary  of  the  learning  phase.  In  this  phase, 
the  task  is  to  develop  a classifier  able  to  distinguish  between  related  classes. 
Data  mining  algorithms  are  used  to  find  the  distinct  topological  characteris- 
tics that  distinguish  between  classes.  The  extracted  characteristics  have  the 
form  of  predictive  rules.  Afterwards,  the  rules  are  integrated  into  a classifier. 
During  the  mining  process,  we  depend  on  qualitative  spatial  analysis  to  find 
the  characteristics  of  a specific  class.  Topology,  direction  and  distance  are  the 
common  qualitative  spatial  relations.  In  this  work,  we  particularly  investigate 
the  topological  relations  to  understand  the  geographic  context  of  the  given 
classes. 

Topological  Analysis  Based  on  the  first  law  of  geography  (Tobler  1970), 
nearby  geographic  features  are  related  to  each  other.  For  example,  the  existence 
of  sport’s  areas  and  playgrounds  inside  the  park  and  garden  features,  the  loca- 
tion of  gas  stations  in  a direct  access  to  roads  area,  etc.  Hence,  in  this  work  we 
investigate  the  topological  relations  between  pairs  of  entities  to  find  the  charac- 
teristics that  identify  each  class.  Each  entity  is  characterized  by  its  interior  and 
exterior  context.  At  the  same  time,  the  appropriate  classification  should  reflect 
the  entities  characteristics  and  matches  its  geographic  context. 


Figure  2:  Learning  phase  of  the  proposed  guided  classification  approach. 
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We  utilize  the  9-Intersection  Model  (9IM)  (Egenhofer  1995),  in  which  the 
topological  relations  between  pairs  of  entities  are  defined  as  follows:  disjoint , 
meet,  overlap,  covers,  coveredBy,  contains,  inside  and  equal.  In  this  work,  disjoin, 
meet,  overlap,  contains  and  coverdBy  relations  are  considered.  While  inside,  cov- 
ers and  equals  relations  are  neglected  due  to  two  reasons:  (a)  inside  and  cov- 
ers are  the  inverse  of  contains  and  coveredBy  relations  respectively;  and  (b)  the 
equal  relation  rarely  occurs  and  does  not  add  information  to  this  analysis. 

Mining  Process  The  topological  analysis  aims  to  find  frequent  patterns  (top- 
ological relations)  involved  between  target  classes  and  other  geographic  fea- 
tures, e.g.  park  contains  playground,  sport  center,  etc.  Each  tag  is  considered  as 
a new  feature.  For  example,  leisure  = playground  and  leisure  = sport  are  treated 
differently.  We  encode  them  as  leisure  playground  and  leisure  sport  respec- 
tively and  associate  each  one  with  a unique  identifier  (ID),  to  facilitate  the  min- 
ing process.  The  processing  is  computationally  exhaustive  and  should  be  done 
in  advance  during  the  preparation  for  the  mining  task.  Afterwards,  the  mining 
process  works  to  extract  atomic  rules  in  the  following  form: 

Class  (E,  C)  <—  R(E,  F)  (1) 

where  E represents  an  entity,  C e {c park ’,  ‘meadow’,  etc.},  R is  one  of  the  topo- 
logical relations  where  R e {‘contains’,  ‘meet’,  etc.}  and  F represents  the  set  of 
frequent  features  that  mostly  involved  in  a relation  R with  entities  of  C. 

We  apply  the  Apriori  algorithm  (Agrawal  et  al.  1994)  to  extract  the  rules.  In 
particular,  we  use  the  class  association  rules  mining  task,  when  rules  have  a pre- 
defined class  (e.g.  ‘park’)  as  their  outcome.  Appropriate  constraint  parameters 
like  support  and  confidence  should  be  adjusted  to  extract  and  filter  the  interest- 
ing patterns.  Afterwards,  the  extracted  rules  are  integrated  into  a classifier. 

Basically,  developing  a classifier  based  on  a set  of  predictive  rules  consists 
of  the  following  steps:  (1)  find  all  the  interesting  class  association  rules  from  a 
dataset;  (2)  filter  the  extracted  rules  into  a set  of  predictive  association  rules;  (3) 
encode  the  rules  into  a classifier;  and  (4)  evaluate  the  classifier  on  a test  dataset. 


Guiding  Phase 

During  the  Guiding  phase,  the  developed  classifier  could  be  used  in  many  dif- 
ferent scenarios.  In  this  approach,  we  present  three  scenarios:  the  contribut- 
ing, checking  and  enriching  scenarios.  Figure  3 gives  brief  illustrations  of  these 
scenarios  as  follows: 

Contributing  Scenario  In  this  scenario,  the  classifier  is  embedded  into  an 
editing  tool.  At  the  contribution  time,  the  classifier  checks  the  validity  of 
a given  classification.  In  case  of  a problematic  classification,  the  classifier 
informs  the  contributor  of  some  recommendations.  Then,  according  to 
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Figure  3:  Guiding  phase  of  the  proposed  guided  classification  approach. 


the  given  recommendations,  the  contributor  reacts  with  a correction  (if 
required).  The  real  challenge  in  this  scenario  is  the  computational  com- 
plexity. During  our  study,  the  entities  are  processed  in  advance  to  inves- 
tigate their  geographic  context.  In  contract,  this  scenario  requires  on-line 
processing  of  contributions  at  the  time  of  editing. 

Checking  Scenario  This  scenario  could  be  directly  applied  when  the  devel- 
oped classifier  is  applied  to  an  existing  dataset  generating  the  potential 
problematically  classified  entities.  Afterwards,  the  outliers  are  presented  - 
associated  with  recommendations  - to  crowd-sourcing  revision.  The  con- 
tributors act  to  correct  the  problematic  entities  (if  required). 

Enriching  Scenario  In  which  the  classifier  is  applied  to  a set  of  unclassified 
entities.  The  classifier  predicts  classifications  for  these  entities  and  pre- 
sents them  for  crowd-sourcing  confirmation.  The  contributors’  role  here 
is  to  confirm  the  given  classification  and  make  corrections  (if  required). 
Another  enrichment  scenario  could  also  be  achieved,  when  the  contribu- 
tor reacts  to  add  more  information  to  satisfy  the  given  recommendations. 

In  all  of  the  proposed  scenarios,  the  classifier  cannot  do  automatic  classifica- 
tion or  automatic  correction  directly  on  the  data  source.  However,  it  provides 
recommendations  for  directing  contributors  towards  data  of  appropriate  clas- 
sification. At  the  same  time,  developing  a global  classifier  might  also  be  inaccu- 
rate. Therefore,  the  proposed  approach  is  to  maintain  locality  during  both  the 
Learning  and  Guiding  phases.  We  assumed  that  a geographic  feature  should  be 
classified  identically,  at  least  on  the  country  level. 
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Empirical  Study:  Grass  & Green 

To  evaluate  the  proposed  approach,  an  empirical  study  and  an  implementation 
are  conducted.  The  study  aims  to  develop  a classifier  to  distinguish  between 
grass-related  classes : forest,  garden,  grass,  meadow  and  park  classes.  The  classes 
are  the  most  common  grass-related  classes  within  the  boundaries  of  urban  cit- 
ies (the  geographic  scope  of  the  research).  The  classification  of  these  features 
represents  a challenge  due  to  the  following:  (1)  in  satellite  images,  they  appear 
similar  as  a green  area;  (2)  in  some  cases,  a feature  could  plausibly  belong  to 
multiple  classes  (e.g.  park  and  garden );  and  (3)  for  non-experts,  they  are  all 
grass.  Thus,  a contributor  might  be  unfamiliar  with  the  characteristics  that  dis- 
tinguish be-  tween  classes.  Table  1 shows  the  OSM  Wiki  recommendations  for 
these  classes.  The  given  recommendations  are  based  on  discussion  between 
mapper  communities.  The  given  recommendations  at  Table  1 indicate  that 
there  exist  unique  characteristics  that  distinguish  between  classes. 


Data  Processing 

We  use  an  OSM  dataset  from  Germany  dated  to  December  2013.  The  choice 
of  Germany  comes  from  the  following  reasons:  i)  the  existence  of  a large  group 
of  active  mappers;  and  2)  several  researchers  have  emphasized  the  quality  of 


Class 

Recommendations 

forest 

Some  use  this  tag  for  land  primarily  managed  for  timber  production, 
others  use  it  for  woodland  that  is  in  some  way  maintained  by  humans. 

garden 

A distinguishable  planned  space,  usually  outdoors,  set  aside  for  the 
display,  cultivation  and  enjoyment  of  plants  and  other  forms  of  nature. 
It  incorporates  both  natural  and  man-made  materials.  The  most  com- 
mon form  is  known  as  a residential  garden,  it  is  a form  of  garden  and 
is  generally  found  in  proximity  to  a residence,  such  as  the  front  or 
back  garden.  Residential  gardens  are  usually  of  human  scale,  as  they 
are  most  often  intended  for  private  use. 

grass 

A tag  for  a smaller  areas  of  mown  and  managed  grass,  for  example 
in  the  middle  of  a roundabout  or  verges  beside  a road.  Should  not  be 
used  where  a more  specific  tag  is  available. 

meadow 

Used  to  tag  an  area  of  meadow,  which  is  an  area  of  land  primarily 
vegetated  by  grass  plus  other  non-woody  plants. 

park 

An  area  of  open  space  provided  for  recreational  use,  usually  designed 
and  in  a semi-natural  state  with  grassy  areas,  trees  and  bushes.  Parks 
are  often  but  not  always  municipal. 

Table  1:  OSM  recommendations  for  the  target  classes. 
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the  data.  We  extract  the  entities  from  the  10  most  densely  populated  cities6  to 
ensure  active  mappers  and  hence  a certain  level  of  quality  The  dataset  consists 
of  3,724  forest,  3,030  garden,  7,336  grass,  4,277  meadow  and  4,445  park  entities. 
About  50%  of  the  extracted  entities  have  only  one  version  (edits),  which  indi- 
cates the  lower  attraction  of  these  entities  to  mappers.  According  to  Mooney  & 
Corcoran  (2012),  an  increasing  number  of  edits  does  not  usually  imply  high 
quality.  However,  it  reflects  the  heavy  collaboration/competition  among  con- 
tributors to  improve  the  data  quality.  The  extracted  entities  are  processed  indi- 
vidually by  checking  the  topological  relations  between  each  entity  and  other 
entities  nearby. 


Learning  Process 

During  the  learning  process,  the  objective  is  to  develop  atomic  rules  per  class 
per  topological  relation.  Due  to  the  uncertainty  of  spatial  context,  we  take  into 
account  that  everything  is  possible.  Thus,  a 1%  support  threshold  is  considered 
sufficient  to  extract  the  interesting  patterns  (frequent  topological  relations). 
Each  topological  relation  is  processed  individually  with  a given  class  producing 
a set  of  predictive  rules  of  that  class.  We  extracted  8,504  rules:  4,100  describe 
forest,  215  describe  garden,  745  describe  grass,  506  describe  meadow  and  2,938 
describe  park.  Although  a large  number  of  rules  have  a confidence  threshold 
greater  than  50%,  the  rules  themselves  represent  some  difficulties  in  the  clas- 
sification process  due  to:  (1)  they  have  a wide  range  of  confidence  threshold 
from  100%  to  0.7  %;  (2)  due  to  the  similarity  between  some  classes,  there  exist 
duplicated  rules  pointing  to  different  classes;  and  (3)  regarding  the  topological 
relations,  some  relations  have  higher  confidence  thresholds  than  the  others. 


Classification  Process 

During  the  classification  process,  each  entity  is  checked  against  all  extracted 
rules.  For  example,  Figure  4 shows  an  entity7  with  a meadow  classification 
which  has  osm_id  = 96279661.  The  entity  matches  46  rules:  26  park,  6 meadow, 
5 forest,  5 garden  and  4 grass.  Table  2 presents  a sample  of  the  matched  rules 
for  this  entity.  The  figure  illustrates  that  the  entity  contains  a playground,  sport 
areas  and  planned  footways,  which  reflect  the  characteristics  of  the  park  class. 

According  to  Table  2,  considering  the  maximum  confidence  of  the  matched 
rules,  the  top  20  rules  have  confidence  thresholds  ranging  form  92%  to  80%  and 
all  of  them  have  the  result  Class(E,  ‘ park ’).  At  the  same  time,  when  consider- 
ing the  maximum  confidence  per  class,  this  entity  matches  with  park,  meadow, 


6 http://www.citymayors.com/ gratis/ german_topcities.html 

7 http://www.openstreetmap.org/way/96279661,  last  accessed  April  2015 
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Figure  4:  A entity  with  osm_id=96279661  classified  as  meadow’  (last  visit  at 
April  2015). 


Rule  — Confidence 

Class  (E,  ‘park’)  <—  contains  (£,  [1,22,156]))  - 92% 
Class  (£,  ‘park’)  <—  contains  (£,  [1,15,22,  156]))  - 91% 
Class  (£,  ‘park’)  <—  contains  (£,  [15,21]))  - 89% 

Class  (£,  ^par/c5)  <—  contains  (£,  [1,15]))  - 88% 


Class  (E,  ‘park’)  <—  contains  (£,  [22]))  - 76% 

Class  (£,  ^par/c5)  contains  (£,  [15]))  - 66% 

Class  (£,  ‘meadow’)  <—  containsBy  (£,  [128]))  - 46% 

Class  (E,  ‘park’)  <—  meet  (£,  [15]))  - 34% 

Where,  l=leisure_playground,  15=highway_footway,  21=sport_soccer,  22=leisure_ 
pitch,  128=landuse_forest,  156=sport_basketball 

Table  2:  A sample  of  matched  rules  for  the  entity  with  osm_id=96279661. 
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grass,  forest  and  garden  classes  in  descending  confidences  of  92%,  46%,  32%,  13% 
and  12%,  respectively.  Although  the  entity  is  currently  classified  as  meadow , its 
characteristics  make  it  more  appropriate  to  be  classified  as  a park.  Hence,  our 
recommendation  works  to  guide  contributors  towards  the  most  appropriate 
classification. 


Evaluation  Process 

To  evaluate  the  classifier,  we  do  not  have  a ground-truth  dataset  for  these  enti- 
ties. The  available  ground-truth  datasets  cover  a higher  classification  level  of 
land  use  or  land  cover.  Thus,  we  depended  on  manual  visual  investigation  to 
evaluate  the  results.  Figure  5 presents  examples  of  appropriate  and  inappropri- 
ate classifications,  based  on  the  developed  classifier  and  recommendations. 

Figure  5(a)  gives  examples  of  appropriate  classifications.  From  left  to  right, 
the  first  entity  is  adjacent  to  residential  houses  and  other  gardens  and  does  not 
contain  much  infrastructure.  The  entity  is  appropriately  classified  as  garden. 
The  second  one,  located  between  highways  and  containing  nothing,  is  most 
likely  to  be  classified  as  grass.  The  last  entity  contains  a water  body,  sports  cent- 
ers, footways  and  other  infrastructure.  It  is  correctly  classified  as  a park. 


(a)  Appropriate  Classification  of  garden , grass  and  park  classes 


(b)  Inappropriate  Classification  of  garden,  grass  wad  park  classes 
Figure  5:  Example  of  appropriate  and  inappropriate  classifications. 
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In  Figure  5(b),  the  classifier  detects  these  entities  as  problematically  classi- 
fied entities.  From  left  to  right,  the  first  entity  is  classified  as  garden.  The  entity 
meets  meadow  and  is  located  near  to  a farmland.  It  does  not  inherit  any  plant  or 
decoration  characteristics.  The  classifier  recommends  meadow  as  an  appropri- 
ate class.  Whereas  the  middle  entity  is  classified  as  grass , despite  the  fact  that 
it  seems  too  large,  contains  sports  centers,  is  surrounded  by  forest  areas  and  is 
adjacent  to  a playground.  The  classifier  recommends  park  class  for  this  entity. 
The  entity  on  the  right  shows  a clear  example  of  inappropriate  classification 
of  park.  The  entity  is  located  between  roundabouts  and  does  not  contain  any 
infrastructure  at  all.  The  classifier  recommends  it  to  be  classified  as  grass. 


Grass&Green:  a quality  assurance  web  tool 

As  another  way  to  evaluate  the  proposed  approach,  we  developed  a web  tool 
as  a recommendation  system  called  Grass&Green8  as  indicated  in  Figure  6. 
The  tool  presents  the  generated  recommendations  for  crowd  revisions  as  pro- 
posed in  the  checking  scenario  (see  section  3.2).  We  created  social  media  pages 
to  attract  the  contributors  for  revisions:  Facebook  and  Twitter.  Moreover,  we 
wrote  OSM  diaries  to  announce  the  tool  to  the  OSM  community.  In  this  tool, 
the  user  logs  in  via  his/her  OSM  account  and  contributes  directly  to  the  project. 
The  tool  presents  entity  by  entity,  combined  with  the  recommended  classes  and 


Figure  6:  Grass&Green:  the  main  contribution  interface  (last  visit  at  September 
2015). 


http:// opensciencemap.org/  quality 
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other  potential  classes.  Due  to  the  ambiguity  of  grass-related  features,  we  pro- 
vide the  users  with  the  two  most  recommended  classes.  The  user  has  the  ability 
to  press  an  “I  don’t  know”  (inactive  participation)  in  unclear  cases.  Moreover, 
the  user  could  also  change  the  recommendations  or  choose  between  (yes,  no  or 
maybe)  in  case  of  high  ambiguity.  The  tool  provides  users  with  detailed  textual 
and  visual  descriptions  about  the  target  classes  and  other  grass-related  classes. 

Eleven  days  after  launching  the  tool,  we  obtained  promising  results.  We  had 
around  80  users  from  various  countries.  They  checked  560  entities:  485  active 
and  75  inactive  participations.  They  agree  with  the  generated  recommenda- 
tions as  follows:  30.10%  full  agree,  60.84%  partial  agree  and  9.05%  disagree. 
The  findings  indicate  the  feasibility  of  the  approach  and  the  tool  acts  perfectly 
to  improve  the  classification  of  OSM  data. 


Discussion  and  Conclusions 

Conceptualization  of  geographic  features  has  long  been  a topic  of  debate  (Frank 
1997).  However,  with  the  increasing  role  of  public  participants  in  collecting 
geospatial  data,  it  becomes  a crucial  issue.  GIS  applications  exploit  VGI  as  an 
auxiliary  data  source.  That  means  the  data  developed  by  public  participants 
is  used  to  provide  services  for  others.  A fact  that  raises  more  attention  to  the 
resulting  data  quality. 

In  particular,  how  do  the  participants  perceive  the  space?  How  do  they  group 
and  categorize  the  geographic  features?  How  do  they  find  the  commonalities 
and  differences  between  conceptually  overlapping  classes?  All  these  questions 
might  be  addressed  by  utilizing  the  developed  geospatial  ontologies.  Frank 
(1997)  discussed  the  vital  role  of  ontology  in  GIS  applications,  to  achieve  a bet- 
ter understanding  of  the  space  and  to  build  more  efficient  information  systems. 
For  example,  the  OWL2  ontology  that  has  been  developed  for  structuring  city 
information  modeling  with  respect  to  land  use  mapping  (Montenegro  et  al. 
2012).  OSMonto  is  another  ontology,  which  has  been  developed  to  enrich  the 
semantics  of  OSM  tags,  without  correcting  or  modifying  any  conceptual  mis- 
takes in  the  taxonomy  of  OSM  tags  (Codescu  et  al.  2011).  However,  the  link 
between  ontologies’  producers  and  consumers,  in  the  GIS  domain,  still  needs 
more  research. 

In  VGI,  the  data  is  classified  following  the  bottom-up  approach;  where  the 
participants  contribute  data  based  on  their  local  knowledge.  They  translate 
their  observations  into  classes  and  categories.  While  in  professional  methods, 
the  data  is  classified  based  on  a top-down  approach;  where  a pre-defined  model 
is  developed  based  on  strict  measures  defining  the  classes.  The  difference  of 
VGI  approach  leads  to  questionable  data  classification.  Therefore,  guiding 
amateur  participants  is  needed  for  enhanced  data  classification.  For  example, 
designing  intelligent  data  capturing  interfaces  is  one  possibility,  among  oth- 
ers, to  support  the  contribution  of  enhanced  data  classification.  This  chapter 
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calls  for  the  development  of  intuitive  interfaces  for  VGI  projects;  negotiation, 
exemplifications  and  comparisons  are  human-centered  approaches  that  could 
be  used  to  support  the  VGI  participants  during  the  contribution  process. 

In  this  chapter,  we  addressed  the  VGI  quality  from  a classification  perspec- 
tive. We  investigated  the  classification  correctness  of  an  entity  with  respect  to  its 
inherent  geographic  characteristics.  We  proposed  an  approach  for  guided  clas- 
sification. The  approach  tackles  the  classification  challenges  in  the  OSM  project 
by  guiding  the  contributors  during  the  classification  process.  The  approach  has 
two  phases:  the  Learning  and  Guiding  phases.  During  the  Learning  phase,  the 
approach  utilizes  the  OSM  dataset  to  learn  the  distinct  topological  character- 
istics that  distinguish  between  similar  classes.  Data  mining  algorithms  have 
been  used  to  develop  a classifier.  Afterwards,  the  developed  classifier  is  used  in 
different  scenarios  (contributing,  checking  and  enriching)  during  the  Guiding 
phase.  The  approach  aims  not  only  to  improve  the  classification  of  data,  but  it 
could  be  used  to  enrich  the  data  source  as  well. 

We  conducted  visual  investigations  and  an  implementation  to  evaluate  the 
proposed  approach.  We  developed  a classifier  to  distinguish  among  a set  of 
grass-related  classes.  The  selected  classes  have  some  similarity,  but  each  one  has 
its  unique  characteristics.  The  findings  emphasized  the  feasibility  of  the  pro- 
posed approach.  The  developed  tool  shows  the  positive  response  of  the  crowds 
towards  the  data  quality.  The  presented  results  are  preliminary  indicators  of 
an  enhanced  data  classification.  We  will  keep  investigating  the  tool  results 
and  check  the  enhanced  data  classification  in  more  details.  In  future  work,  the 
research  would  investigate  how  to  generalize  the  developed  classifier.  In  addi- 
tion, the  intuitive  user  interface  would  be  studied  to  develop  human- centered 
guided  classification  that  corresponds  to  the  nature  of  VGI. 
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Abstract 

This  chapter  presents  contributions  to  managing  the  quality  of  Volunteered 
Geographical  Information  (VGI)  and  of  crowd  sourced  geographical  informa- 
tion (CSGI)  brought  by  the  representation  of  specific  knowledge  items:  task 
and  context.  Task  and  context  modelling  have  been  studied  in  different  com- 
munities. We  propose  an  approach  for  integrating  their  results  with  the  per- 
spective of  improving  the  quality  management  of  VGI  and  CSGI. 

Keywords 
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Introduction 

The  ENERGIC  COST  ACTION  targets  the  usage  of  Volunteered  Geographical 
Information  (VGI)  and  of  crowd  sourced  geographical  information  (CSGI)  in 
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scientific  applications.  One  challenge  addressed  in  this  action  is  data  quality 
management. 

Data  quality  management  has  been  addressed  for  several  years  with  respect 
to  classical  geographical  data,  by  operational  bodies  like  the  National  Mapping 
Agencies  and  by  scientists  mainly  from  the  geographical  information  science 
community.  The  issues  they  tackle  are  also  relevant  in  the  context  of  volun- 
teered and  user  generated  data,  for  example  managing  the  lack  of  a univer- 
sal data  model  about  geographical  space  and  the  unavoidable  heterogeneities 
between  geographical  data.  Section  1 lists  findings  about  geographical  data 
quality  management  from  the  literature  and  from  the  experience  of  the  French 
National  Mapping  Agency,  IGN.  These  findings  relate  to  the  management  of 
data  specifications,  the  definition  and  documentation  of  quality  criteria,  the 
assessment  of  which  inherent  characteristics  of  the  data  will  impact  the  output 
of  a given  application  result  and  the  communication  of  quality  to  the  user. 

Section  2 specifically  studies  the  potential  of  context  and  tasks  modelling  to 
implement  these  findings  in  the  context  or  VGI  and  CSGI.  Context  can  account 
for  much  heterogeneity  in  VGI  and  CSGI.  Tasks  are  useful  pieces  of  knowledge 
to  plan  the  usage  of  relevant  resources  to  achieve  an  objective.  Context  and 
tasks  modelling  are  studied  by  communities  tackling  information  management 
and  exchange  between  implemented  components  and  humans,  like  distributed 
architectures,  interoperability,  ubiquitous  mapping,  location  based  services  and 
human-machine  dialogue  interfaces. 

An  approach  to  integrate  the  context  and  tasks  models  to  address  part  of  the 
research  questions  expressed  in  the  beginning  of  this  chapter  is  discussed  at  the 
end  of  this  chapter. 

Geographical  data  quality  management 
External  quality 

Quality  is  defined  in  ISO  9000  as  the  degree  to  which  a set  of  inherent  character- 
istics fulfils  some  requirements.  This  definition  of  quality  is  relative  to  an  appli- 
cation. For  example,  important  inherent  characteristics  of  3D  data  for  visualiza- 
tion applications  refers  to  accurate  and  realistic  textures  as  well  as  consistency 
of  visible  shape  elements  and  very  low  level  of  detail  for  elements  non-visible  in 
the  current  scene.  Important  characteristics  of  the  same  data  for  firemen  access 
application  requirements  refers  to  the  exhaustiveness  and  geometric  accuracy  of 
specific  features  like  windows,  electricity  cables  and  tramway  cables.  This  quality 
is  referred  to  as  ‘fitness  for  use  or  ‘external  quality  (Devillers  & Jeansoulin  2006). 

Most  applications  considered  in  ENERGIC  action  share  some  common  data 
requirements: 

• the  ability  to  discover  and  reuse  the  data, 

• the  ability  to  combine  the  data  with  other  data, 
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• the  ability  to  ground  a result  on  data,  i.e.  to  control  the  gap  between  data 
and  interpretation. 

The  two  first  criteria  are  recurrent  and  are  promoted  by  initiatives  like  the 
INSPIRE  directive  or  by  the  W3C  vision  for  Linked  Open  Data  where  data 
should  be  produced  once  and  made  available  (Bizer,  Heath  & Berners  Lee  2009). 
The  third  criterion  relates  to  having  a documentation  of  uncertainty  related  to 
the  data  and  how  it  propagates  to  the  result.  Besides,  when  it  comes  to  volun- 
teered content,  it  is  also  useful  to  consider  requirements  expressed  by  the  Web 
community.  W3C  propose  a ranking  scheme  for  open  data  that  list  important 
quality  criteria  from  their  perspective:  publication  on  the  web  (one  star),  in  a 
machine  readable  format  (two  stars),  in  a non-proprietary  format  (three  stars), 
compliant  to  RDF  standards  - using  dereferenceable  URIs  to  name  things  (four 
stars)  and  publish  links  to  other  URIs  (five  stars). 

With  respect  to  the  above  requirements,  external  quality  of  data  will  very 
much  depend  on  metadata  and  documentation. 

Besides,  user  requirements  will  eventually  be  met  by  an  application  involv- 
ing software  and  data.  Hence,  geographical  data  quality  assessment  is  closely 
related  to  geographical  software  quality  assessment. 

Internal  quality 

A specific  intermediate  quality  concept  is  needed  to  document  inherent  char- 
acteristics of  geographical  data  that  will  be  useful  for  every  user  to  evaluate 
their  ability  to  fulfil  their  application  requirements.  This  is  the  ‘internal  qual- 
ity’ (Devillers  & Jeansoulin  2006).  The  data  producer  should  distribute  its  data 
together  with  the  description  of  this  internal  quality  and  the  users  can  at  the 
end  use  this  description  (and  the  data)  to  assess  external  quality  of  the  data 
for  their  application.  Indeed,  many  geographical  data  (base  maps  for  instance) 
have  seldom  been  acquired  for  one  specific  application  but  rather  to  be  reused 
in  several  applications,  and  possibly  by  users  who  sometimes  are  far  from  the 
production  of  the  data  and  who  did  not  express  their  quality  requirements, 
hence  did  not  express  which  inherent  characteristics  are  important  for  them. 

Internal  quality  has  been  traditionally  documented  by  national  mapping 
agencies,  and  in  current  ISO/OGC  metadata  standards  (ISO  TC211  2014) 
based  on  three  elements: 

• the  targeted  description,  called  the  data  specifications, 

• quality  criteria  describing  some  distance  between  the  produced  data  and  an 
imaginary  flawless  data  sets  compliant  with  these  specifications, 

• the  lineage  metadata. 

In  other  words,  data  producers  have  considered  that  the  characteristics  that 
will  help  future  users  assess  the  external’  quality  of  their  data  are  globally  the 
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assessment  of  how  far  the  geographical  data  are  from  the  reality  they  aim  at 
describing.  These  three  items  are  described  more  precisely  hereafter. 

The  specifications  are  the  scope  of  the  observation  process  that  will  lead  to 
the  data:  when,  where,  what  objects,  what  level  of  detail.  It  is  crucial  to  explic- 
itly describe  specifications  because  there  is  no  natural  such  abstraction  even 
though  one  may  intuitively  think  so.  Our  world  is  heterogeneous  in  multiple 
ways  whereas  an  abstraction  will  provide  only  one  specific  set  of  categories 
and  classification  schemes.  The  most  satisfying  solution  so  far  is  to  provide 
abstractions  that  are  relevant  for  a certain  spatial  and  temporal  scope  and  also 
for  a given  point  of  view  on  reality.  Not  only  are  specifications  an  important 
item  for  describing  quality,  but  also  they  improve  the  homogeneity  of  the 
data  during  data  production.  It  has  always  been  a necessity  and  a challenge 
to  share  specifications  among  operators  involved  in  the  acquisition  process 
for  national  mapping  agencies  producing  topographical  data  over  a national 
territory  (Sheeren,  Mustiere  & Zucker  2004).  Indeed,  geographical  database 
specifications  should  refer  to  a common  ontology  of  reality  which  does  not 
exist  (Abadie  2009). 

Quality  criteria  are  measures  of  distance  between  produced  data  and  what  is 
called  ‘terrain  nominal’,  i.e.  data  that  would  have  been  produced  strictly  consid- 
ering the  specifications  (and  in  real  time).  When  a quality  criteria  is  attached 
to  a product  (and  not  to  a specific  data  set),  it  means  a commitment  of  the 
producer  to  respect  a certain  thresholds  during  data  production.  Quality  crite- 
ria describing  ‘uncertainties’  and  errors’  possibly  introduced  during  the  actual 
production  have  been  standardized  after  four  fundamental  dimensions:  posi- 
tional accuracy,  attribute  accuracy,  logical  consistency,  completeness  (Good- 
child  & Li  2012).  Usually  a product  description  includes  some  commitments  of 
the  producer  about  these  criteria  threshold.  For  example  such  metadata  for  a 
road  data  product  can  be:  the  product  should  describe  every  road  longer  than 
50  m,  thanks  to  a series  of  points  acquired  at  the  axis  of  the  road,  with  10  m 
precision,  with  time  accuracy  of  6 months  and  exhaustiveness  of  98%  on  the 
national  territory.  Whereas  these  examples  refer  to  explicit  attributes  and  enti- 
ties, it  is  also  important  in  geographical  data  to  consider  some  implicit  spatial 
properties  and  relationships.  An  important  paradigm  of  geographical  data  is 
that  many  relations  and  properties  are  not  explicit  in  the  data  but  can  be  com- 
puted based  on  the  coordinates.  Several  authors  study  the  evaluation  of  some 
spatial  properties  and  relationships,  usually  referred  to  as  spatial  consistency 
rules  (Servigne  et  al.  2000).  A recurrent  quality  criteria  referring  to  consistency 
is  the  topology. 

Last,  the  lineage  metadata  refers  to  sources  data  and  processes  that  led  to 
the  data.  It  is  somehow  comparable  to  the  ‘source  code’  of  a software  that 
will  be  useful  to  debug  i.e.  if  something  unexpected  happens  in  the  applica- 
tion to  investigate  if  it  can  be  explained  by  the  geographical  data  production 
process. 
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Quality  challenges  intensified  by  VGI  and  CSGI  context 

Whereas  by  essence,  VGI  and  CSGI  production  process  can  be  seen  as  an 
opportunity  to  improve  some  dimensions  of  internal  quality  like  the  update 
frequency,  this  production  process  also  makes  it  more  difficult  to  handle  cer- 
tain dimensions  of  internal  quality 

The  main  flaw  in  our  opinion  is  the  weak  definition  and  documentation  of 
expected  specifications  of  the  produced  data.  Besides,  Linus’s  law  given  enough 
eyes,  all  bugs  are  shallow’,  referring  to  the  ability  of  the  crowd  to  converge  on 
the  truth,  does  not  always  work  for  VGI.  Goodchild  and  Li  (2012)  and  Haklay 
et  al.  (2010)  have  shown  that  users  do  not  always  agree  on  a value.  This  is  also 
a motivation  for  explicitly  stating  quality  specifications.  Brando  and  Bucher 
(2010)  focus  on  the  definition  of  such  specifications  for  user  generated  geo- 
graphical content  prior  to  the  production  of  the  data.  They  proposed  a method 
to  instantiate  specifications  based  on  OSM  tags,  Wikipedia  infoboxes  and  the 
NMA  product  specifications  (Brando,  Bucher  & Abadie  2011).  Yet,  their  work 
does  not  address  the  issue  of  acceptability  of  these  specifications  by  contribu- 
tors and  of  evolution  of  such  specifications. 

One  aspect  of  the  proposal  concerns  the  establishment  of  explicit  consistency 
rules  between  the  user  generated  content  and  reference  data  provided  by  the 
French  national  mapping  agency,  a public  funded  professional  organizations 
who  commit  to  reach  specific  level  of  quality  criteria  for  some  image  data  and 
topographic  themes.  Acquiring  external  rules  that  can  be  used  to  evaluate  con- 
sistency of  geographical  content  now  is  more  generally  an  important  domain  of 
research  (Goodchild  & Li  2012)  and  has  led  to  the  creation  of  the  organization 
OSMGB  in  UK  which  aims  at  listing  such  rules  and  setting  up  a formal  quality 
insurance  model  to  improve  the  trust  of  local  administration  in  collaborative 
geodata. 

Another  flaw  is  the  lack  of  explicit  commitment  to  follow  these  specifica- 
tions and  reach  quality  criteria  thresholds,  e.g.  of  any  update  frequency,  and 
the  lack  of  assessment  of  quality  criteria  to  document  the  gap  between  acquired 
data  and  the  specifications.  So  far,  documentation  of  quality  criteria  is  done 
in  punctual  studies,  like  for  research  about  the  quality  of  VGI  data  comparing 
OSM  data  with  data  whose  quality  already  is  documented,  like  Haklay  (2010) 
in  the  UK  and  Girres  and  Touya  (2010)  in  France.  Goodchild  and  Li  (2012) 
and  Haklay  et  al.  (2010)  also  showed  evidence  that  there  are  not  always  enough 
people  interested  in  a particular  area  or  feature. 

To  conclude  this  first  section  of  the  chapter,  managing  quality  of  VGI  and 
CSGI  can  benefit  from  knowledge  gained  about  the  management  of  quality  of 
geographical  data.  The  quality  of  a data  set  is  documented  either  according  to 
a dedicated  application  or  in  a more  generic  way  as  a distance  between  a flaw- 
less ideal  representation  of  a geographical  space  conforming  to  a given  abstract 
model  and  a data  set  produced  by  remote  sensing,  in  situ  sensing  and  symbolic 
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knowledge  production.  It  is  highly  recommended  to  define  explicitly  a targeted 
abstraction  of  a geographical  space  and  it  is  easier  to  try  and  provide  some  that 
is  ‘locally’  relevant.  A relevant  abstract  model  should  not  only  be  composed 
of  classification  schemes  but  also  of  consistency  rules.  Documenting  the  dis- 
tance between  a targeted  abstraction  and  a dataset  cannot  be  done  exactly  but  is 
approximated  by:  quality  criteria  (exhaustiveness  of  a feature  or  of  an  attribute, 
and  so  on),  lineage  information  (also  known  as  provenance  metadata). 

When  it  comes  to  VGI  and  CSGI  specific  stakes  are: 

• the  actual  description  and  maintenance  of  data  specifications, 

• the  shareability  of  specifications  among  the  contributors,  among  users  and 
the  possibility  to  compare  and  align  the  model  it  with  other  data  models, 

• producers’  commitment  to  quality  criteria. 


Contributions  brought  from  context  and  tasks  modelling 

This  section  lists  some  contributions  to  address  the  objectives  of  quality  man- 
agement listed  just  before. 


Context  modelling 

Firstly,  since  there  is  no  such  thing  as  a universal  widely  shared  abstract  model 
of  reality,  we  advocate  it  is  better  to  keep  the  data  as  close  as  possible  to  their 
production  process  (typically  to  keep  sensor  data)  with  context  information 
that  explain  the  data  (see  Chapter  Enquiring  VGI)  then  trying  to  merge  every 
contribution  into  a pivot  model.  In  this  perspective,  context  modelling  is  an 
important  metadata  to  account  for  much  heterogeneity  in  VGI  and  CSGI. 

Some  context  elements  are  already  studied  in  the  literature  about  VGI  to  infer 
quality  and  trust  metadata  like  the  contributor  profile,  his  status  within  the 
VGI  system  (normal/advanced  user  in  Wikimapia,  normal/sysop  in  Wikipe- 
dia, ordinary/Data- working-group  in  OSM),  his  motivation  and  level  of  quality 
requirements  with  respect  to  data  (Coleman  et  al.  2009),  the  places  they  live  in 
(Goodchild  2009)  (Bishr  & Kun  2007),  their  relationships  with  other  contribu- 
tors (Bishr  & Kuhn  2007). 

Other  relevant  elements  are  studied  in  the  domain  of  location  based  services 
and  ubiquitous  mapping  where  context  is  an  important  element  to  understand 
how  someone  may  mentally  interact  with  an  abstract  representation  - usually 
accessible  through  a visual  representation-  of  his  surrounding,  which  are  the 
time  of  the  day,  the  season,  the  user  age,  nationality,  gender  (Jakobsson  2002) 
and  culture  (Edsall  2007). 

Another  very  important  context  element  is  the  contributor  intention.  In  col- 
laborative content  edition,  it  is  described  through  the  effects  of  the  contribution 
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on  the  content  at  the  moment  when  the  contribution  was  defined  (Sun  et  al. 
1998),  for  example:  refining  a shape,  fixing  an  alignment  between  two  features, 
adding  a missing  building.  Describing  the  intended  effect  on  the  representation 
requires  some  abstract  model  of  reality  that  must  be  as  close  as  possible  to  the 
model  the  contributor  had  in  mind.  This  refers  to  the  possibility  for  the  con- 
tributor to  annotate  contributions  with  a shareable  abstract  model.  To  enhance 
the  interoperability  of  abstract  models,  it  is  now  encouraged  to  publish  them  as 
Vocabularies  on  the  web  of  data,  i.e.  as  RDF  schemas  available  online  thanks 
to  dereferenceable  URIs.  RDF  vocabularies  to  distribute  and  share  geodata 
are  studied  in  the  geographical  information  domain  and  in  the  semantic  web 
community  (Goodwin,  Dolbear  & Hart  2008;  Vilches-Blazquez  et  al.  2010; 
Atemezing  et  al.  2014). 


Task  modelling 

Tasks  models  organize  knowledge  about  the  usage  of  relevant  resources  to 
achieve  an  objective.  In  the  context  of  VGI  and  CSGI  quality  management,  this 
is  useful  with  respect  to  modelling  three  kinds  of  tasks: 

• the  usage  of  space  by  a citizen  when  he  is  producing  data,  for  instance  going 
to  work  - and  producing  a GPS  track, 

• the  collaboration  or  cooperation  between  citizens  to  produce  data,  for 
instance  the  organization  of  edition  during  a mapping  party, 

• the  user  task  that  requires  geographical  data,  for  instance  evaluating  the 
impact  of  a new  road  on  the  local  biodiversity. 

The  first  kind  of  task  is  an  element  of  context  that  is  useful  to  elicit  the  abstract 
model  people  have  in  mind  when  they  produce  data  -the  last  context  element 
mentioned  in  section  above-.  As  demonstrated  by  (Gibson  1979),  people  see 
the  landscape  through  his  functional  relevance  to  their  goals.  In  other  words,  if 
a contributor  rides  a bike  he  will  see  the  street  from  a different  perspective  than 
if  a contributor  is  in  a wheeling  chair. 

The  second  kind  of  task  has  been  studied  by  (Das  et  al.  2014)  who  experi- 
mented with  a task  assignment  model  to  organize  the  production  of  one  con- 
tent among  several  contributors  to  optimize  exhaustiveness,  cost  and  precision. 
The  production  is  modelled  as  a task  decomposed  into  subtasks  that  can  be 
assigned  to  people.  The  system  requires  user  profiles  to  make  the  assignment 
based  on  user  expertness  and  availability,  and  define  the  reward  they  need. 
There  exists  relevant  work  in  the  literature  to  guide  strategies  for  collabora- 
tive geographical  data  production.  Wilkinson  and  Huberman  (2007)  study  the 
nature  of  the  collaboration  that  will  impact  the  quality  of  the  produced  content. 
Maue  and  Schade  (2008)  propose  a solution  where  contributors  ask  themselves 
for  reviewers  when  they  lack  confidence  in  their  own  contributions. 
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In  the  domain  of  model  collaborative  edition,  some  authors  have  proposed 
a model  were  user  contributions  are  directly  expressed  as  operations  and  not 
as  a new  content,  in  order  to  be  as  close  as  possible  to  contributor  intention. 
In  Brando,  Bucher  and  Abadie  (2011),  user  edition  can  be  expressed  as  the 
enforcement  of  relationships  (i.e.  implicit  information)  instead  of  geometries 
because  the  authors  thought  users  may  be  more  expert  in  assessing  relation- 
ships between  objects  than  geometries. 

Rehrl  et  al.  (2013)  proposed  a task/operation  based  model  to  analyse  user 
contribution  to  a collaborative  geographical  content. 

Last,  task-based  application  design  can  be  useful  to  express  external  quality 
criteria.  The  application  is  modelled  as  a task  which  has  pre  and  post  conditions, 
input  and  output  data  (Sun  & al.  2012).  A task  also  has  a method  to  decompose 
high  level  tasks  into  elementary  tasks,  noting  that  these  can  be  either  machine 
tasks  (computation)  or  user  tasks  (e.g.  finding  a geographic  feature  on  a map). 
As  an  example  let  us  consider  the  task  ‘find  a restaurant’.  This  task  is  associated 
to  subtasks  such  as  (1)  consult  the  list  of  all  restaurants  in  a given  area,  which 
requires  the  completeness  in  the  area,  with  an  accuracy  of  10  m,  (2)  ‘find  route 
to  address’,  which  requires  a traffic  network  representation  that  is  topologically 
correct  and  complete.  The  evaluation  of  fitness  for  use  can  benefit  from  the 
development  of  typologies  and  ontologies  of  tasks  performed  on  spatial  data. 
Several  researchers  have  already  worked  in  this  direction.  For  instance,  von 
Hunolstein  and  Zipf  (2003)  define  a task  typology  in  map-based  mobile  guides: 
high-level  tasks  have  been  associated  to  subtasks  and  a mapping  between  goals 
and  tasks  has  also  been  defined.  For  example  the  task  ‘Navigation  is  associated 
to  subtasks  such  as  ‘routing  from  point  A to  B’  and  to  goals/purposes  ‘navi- 
gation, exploring,  planning,  education.  Park,  Yoon  and  Kwon  (2012)  present 
a task  ontology  for  intelligent  tourist  information  service,  based  on  travelers’ 
needs  and  activities.  Lemmens  (2006)  proposed  an  ontology  to  support  the 
chaining  of  operations  in  geographical  information  architectures.  Bucher  and 
Jolivet  (2008)  demonstrated  the  difficulty  to  document  pre  and  post-conditions 
of  an  elementary  task  (Bucher  & Jolivet  2008).  Beyond  defining  a vocabulary  to 
express  pre  and  post  conditions,  a major  bottleneck  is  the  acquisition  of  their 
value  because  it  requires  setting  up  benchmarks  simulating  all  possible  specific 
cases  of  geometrical  configurations. 


Discussion  and  conclusion 

Quality  management  traditionally  requires  the  documentation  of  specifications, 
the  control  of  quality  criteria  value,  and  the  description  of  lineage  metadata. 

An  important  challenge  raised  by  VGI  CSGI  quality  management  is  ambi- 
guities, inconsistencies  and  heterogeneities  due  to  different  abstractions  of 
the  geographical  space  involved  in  production.  These  are  not  limited  to  fea- 
tures classifications;  they  should  also  include  important  relationships  between 
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elements  used  in  consistency  management,  affordances  of  features  in  contribu- 
tor activities  and  rules  to  encode  the  perceived  reality  in  data.  Another  chal- 
lenge is  to  manage  the  quality  of  data  products,  hence  to  somehow  commit  to 
some  thresholds  for  the  quality  criteria. 

In  section  2 we  advocated  that  it  is  very  relevant  to  tackle  these  issues  from 
the  perspective  of  knowledge  engineering.  The  derivation  of  usable  informa- 
tion from  raw,  heterogeneous  and  distributed  acquisitions  would  greatly  ben- 
efit from  enhanced  model  of  the  context  in  which  a contribution  is  produced. 
The  modelling  of  information  derivation  from  raw  acquisition  can  be  seen  as  a 
flexible  process  where  the  integration  is  done  when  it  is  needed  and  where  the 
sources  are  preserved  as  much  as  possible  in  order  not  to  lose  any  meaningful 
information.  The  notion  of  context  comprehends  many  elements  which  have 
already  been  studied  in  various  domains  like  VGI  quality  assessment,  ubiqui- 
tous mapping  and  ecology.  Task  models  can  also  contribute  to  this  knowledge 
engineering  project  in  several  ways:  to  clarify  how  users  perceive  the  space  they 
will  describe,  to  get  external  quality  criteria,  and  to  improve  the  coordination 
of  citizen  and  their  interactions  towards  the  production  of  a common  content. 

There  is  still  work  to  be  done  to  integrate  the  different  findings  in  context 
modelling  and  in  tasks  modelling.  An  interesting  perspective  is  to  improve  the 
description  of  user  intention  when  they  contribute.  Rehrl  et  al.  (2013)  paves  the 
way  for  a relevant  approach  of  the  problem.  Their  low-level  tasks  categoriza- 
tion, such  as  create/update  a geographic  feature  or  a relation,  could  be  extended 
to  conceptualize  higher  level  intentions,  such  as  for  instance  to  reflect  a change 
of  navigation  restriction  that  occurred  in  the  reality,  to  propose  a more  detailed 
description  of  the  cross-road  geometry  and  topology,  to  update  an  attribute 
value  to  reflect  a change  in  the  specifications,  to  fix  an  inconsistent  misalign- 
ment of  buildings  in  the  data.  Other  typical  VGI  tasks  need  modelling  such  as 
selecting,  evaluating,  integrating  existing  data,  assigning  sensor  task  to  contrib- 
utors, evaluating  user  capacities  with  respect  to  quality  criteria.  The  examina- 
tion of  data  quality  issues  and  the  literature  shows,  in  our  mind,  an  opportunity 
to  define  an  ontology  of  ‘human  sensing  tasks  that  would  describes  capacities 
to  produce  pieces  of  data  by  a given  human  agent  or  several  human  agents 
together,  with  explicit  objectives  assigned  and  in  a given  observation  context. 
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Abstract 

This  chapter  provides  a brief  overview  of  some  methodologies  used  to  extract 
meaning  from  the  analysis  of  geotagged  images.  Broadly  they  draw  from 
research  in  natural  language  processing  and  statistical  and  exploratory  tech- 
niques. The  confidence  we  attach  to  outputs  from  such  analysis  depends  upon 
the  questions  we  ask,  our  ability  to  take  account  of  both  the  behaviour  and 
motivation  of  the  users  contributing  to  user  generated  content,  and  the  close 
relationship  between  how  the  data  are  spatially  aggregated  and  the  meanings 
associated  with  descriptions  of  images. 
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Introduction 

We  continue  to  witness  phenomenal  growth  in  the  production  of  user  gener- 
ated content  (UGC).  Some  of  that  content  comes  in  the  form  of  photographs. 
Many  are  either  annotated  or  tagged  in  a manner  that  may  reveal  aspects  of 
users’  conceptual  understanding  of  place.  In  this  article  we  concern  ourselves 
with  methods  to  extract  meaning  from  large  collections  of  textually  annotated 
georeferenced  photographs.  Such  collections  have  been  the  subject  of  consider- 
able attention  over  the  last  decade,  for  a number  of  reasons.  Firstly,  and  perhaps 
most  importantly,  the  data  are  accessible.  For  instance,  both  Flickr1  and  Pano- 
ramio2  provide  application  programming  interfaces  which  make  it  possible  for 
researchers  to  scrape  images  and  associated  metadata,  while  Geograph3  con- 
tent is  provided  under  a Creative  Commons  Attribution- Share  Alike  licence. 
Secondly,  unlike  other  social  media,  the  link  between  position,  annotation  and 
content  is  often  relatively  direct  and  closely  linked  to  peoples  sense  of  place. 
People  take  pictures  of  things  and  events  that  happen  somewhere,  at  some  time, 
and  describe  them  accordingly. 

A wide  range  of  applications  have  been  developed  that  variously  utilise  this 
data  in  order  to  extract  information  and  meaning: 

• Automatic  generation  of  gazetteer  data  (Kessler  et  al.  2009) 

• Extraction  and  delineation  of  vernacular  place  names  (Hollenstein  & 
Purves  2010) 

• Tag  recommendations  for  images  based  on  location  (Rattenbury  & Naaman 
2009) 

• Adding  information  to  existing  spatial  databases  (Antoniou  et  al.  2010) 

• Extraction  of  place  semantics  at  a range  of  scales  (Feick  & Robertson  2014; 
Purves  et  al.  2011;  Rattenbury  & Naaman  2009) 

• Summarising  and  aggregating  properties  of  the  semantics  of  space  (Ahern 
et  al.  2007;  Dykes  & Wood  2009;  Purves  et  al.  2011) 

• Exploring  movement  of  groups  of  individuals  in  space  (Girardin  et  al.  2008) 

• Identification  and  prediction  of  locations  in  text  (O’Hare  & Murdock  2012) 

• Extraction  of  events  using  space-time  clustering  (Andrienko  et  al.  2010) 

All  of  these  approaches  require  methods  which  go  beyond  analysing  spatial 
patterns  associated  only  with  the  locations  of  photographs.  This  is  the  province 
of  an  established  toolbox  of  geo  statistical  techniques  for  point  pattern  analy- 
sis able  to  describe  spatial  distributions  and  multi-scale  patterns  (O’Sullivan  & 
Unwin  2003:  chap.  4).  Additionally  we  may  wish  to  infer  place  semantics  from 
other  metadata  associated  with  images  (e.g.  user,  annotation,  and  time  as  well 


1 www.flickr.com 

2 www.panoramio.com 

3 http://www.geograph.org.uk/ 
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as  location).  In  this  article  we  are  only  concerned  with  annotations  written 
by  the  user:  the  person  who  uploaded  the  photograph,  and  typically,  but  not 
always,  took  it.  This  user  information  allows  association  of  a set  of  photographs 
with  an,  pseudo-anonymous,  individual  and  annotations  can  take  the  form  of 
a title,  a narrative  (often  descriptive  text),  and  a set  of  tags.  Tags  are  lists  of 
key  words  selected  freely  by  a user  (Rattenbury  & Naaman  2009)  and,  like  all 
annotation  associated  with  photographs  in  user  generated  content,  may  have  a 
number  of  different  motivating  factors,  including  organisation  of  content  for 
personal  reasons,  providing  informative  descriptions  and  making  photographs 
findable  by  others.  Locations  may  reflect  the  scene  photographed,  but  with  the 
advent  of  smart  phones  capable  of  automatically  annotating  images  with  GPS 
coordinates,  more  commonly  reflect  the  photographer  s position.  Finally,  tem- 
poral information  often  reflects  both  time  of  upload  to  the  database  and  the 
time  at  which  a photograph  was  recorded  by  a camera  as  having  been  taken. 

This  set  of  properties  allows  us  to  formulate  a set  of  basic  questions  which  can 
be  asked  of  a collection  of  annotated,  georeferenced  photographs: 

1)  What  language  is  used  to  describe  photographs? 

2)  How  can  structured  knowledge  be  extracted  from  annotations? 

3)  What  influence  do  users  have  on  information  extracted  from  annotated 
georeferenced  photographs? 

4)  How  can  we  capture  the  relationship  between  language  and  location? 

5)  How  do  descriptions  extracted  from  annotations  vary  according  to  scale 
and  region  definitions? 

In  the  following,  we  introduce  a methodological  toolbox,  drawn  from  a repre- 
sentative set  of  literature  working  on  georeferenced  annotated  images,  which 
allows  us  to  explore  these  questions.  As  argued  above,  our  focus  goes  beyond 
purely  spatial  analysis,  and  in  particular  focuses  on  textual  annotations.  In  fact, 
many  of  the  methods  applied  come  from  the  domains  of  statistical  natural  lan- 
guage processing  and  information  retrieval  and  focus  on  extracting  informa- 
tion from  a corpus  (Manning  & Schiitze  1999).  Common  to  all  corpora  are  the 
basic  notions  of  documents  (in  our  case  represented  by  annotations  related 
to  an  individual  photograph).  Information  about  authorship  (in  our  case  in 
the  form  of  unique  users)  is  somewhat  less  common,  and  explicit  links  to  spa- 
tial locations  are  what  make  our  collections  of  georeferenced  photographs 
particularly  interesting.  Thus,  in  the  following,  we  will  firstly  introduce  some 
global  analysis  methods  - and  ignore  potential  stratifications  of  the  data  by 
user  or  location  (Questions  1 & 2).  We  will  then  discuss  the  link  between  user 
behaviour  and  language  (Question  3)  before  finally  looking  at  the  explicit  link 
between  language,  location  and  scale  (Questions  4 & 5). 

In  this  paper  we  use  as  exemplary  data  two  examples  of  UGC:  Geograph 
and  Flickr.  Our  analyses  are  based  on  previous  work  reported  in  Purves  et  al. 
(2011).  We  focus  on  two  forms  of  text  input  associated  with  georeferenced 
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images  in  the  British  Isles:  firstly  short  descriptive  texts  from  Geograph,  and 
secondly,  tags  associated  with  images  in  Flickr. 


Global  analysis  methods 

The  first  question  that  we  can  ask  of  any  corpus  concerns  its  composition.  These 
are  simple  questions  of  frequency  - what  words  occur  and  how  often,  and  how 
are  frequencies  distributed  in  a corpus.  A second,  often  neglected  question  is  to 
ask,  are  the  answers  to  the  former  in  any  way  surprising? 

For  narrative  text,  function  words  (prepositions,  conjunctions,  pronouns) 
will  typically  be  most  frequent  in  any  corpus  and  only  by  filtering  out  such 
terms  (often  called  stop  words)  or  exploring  specific  parts  of  speech  (for 
example  the  use  of  nouns  and  proper  nouns)  can  peculiarities  of  a collec- 
tion with  respect  to  general  language  be  explored  (Manning  & Schiitze  1999; 
Purves  et  al.  2011)  (Table  1).  Word  frequencies  in  a corpus  typically  broadly 
follow  Zipf ’s  law  - frequency  is  inversely  proportional  to  rank.  This  implies 
in  turn  that  a small  number  of  different  words  account  for  a large  proportion 
of  the  total  word  count  in  any  corpus,  and  many  words  occur  rarely  in  a given 
corpus. 

It  is  important  to  note  that  tag  lists  are  typically  shorn  of  much  the  accou- 
trement of  narrative  text,  and  consist  of  relatively  informative,  freestanding 
terms  (O’Hare  & Murdock,  2012;  Purves  et  al.  2011;  Rattenbury  & Naaman 
2009).  Thus,  frequency  counts  of  tags  may  already  be  informative  with  respect 
to  semantic  content,  with  for  example  around  80%  of  the  Flickr  tags  analysed 
by  Purves  et  al.  (2011)  taking  the  form  of  generic  nouns  (e.g.  church4,  hill, 
wedding)  or  proper  nouns  (e.g.  tom,  monday,  nikon,  edinburgh)  (Table  1). 
Hollenstein  and  Purves  (2010)  reported  an  average  of  25%  of  tags  as  referring 
to  locations  and  Rattenbury  and  Naaman  (2009)  identified  some  12-16%  of 
tags  as  being  place  tags’.  Place  tags  still  typically  show  Zipfian  distributions. 

In  the  above  we  effectively  ignore  the  semantics  or  meaning  of  individual 
terms  or  tags.  Thus,  forest  and  woods  are  treated  as  entirely  independent  terms, 
as  are  New  York  and  Big  Apple,  despite  their  obvious  overlapping  meanings. 
The  first  step  in  dealing  with  this  problem  is  tokenisation  - that  is  parsing  some 
given  input  text  to  a set  of  meaningful  units.  This,  at  first  glance,  trivial  prob- 
lem is  anything  but.  Approaches  to  tokenisation  can  have  significant  impacts 
on  results  (for  example,  is  New  York  one  token  or  two?)  (Manning  & Schiitze 
1999:  chap.  4).  The  second  step  typically  involves  the  use  of  more  advanced 
methods  such  as  lemmatisation  and  tagging  of  parts  of  speech,  which  fall  firmly 
into  the  domain  of  natural  language  processing.  Once  again,  the  popularity  of 
tags  can  be  attributed  to  their  simple  structure,  but  it  is  important  to  note  that 


4 We  refer  to  tags  in  the  text  thus:  tag 
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Geograph 
(Top  10) 

Geograph  (Top 

10  nouns) 

Flickr  (Top  10) 

Rank 

Count 

Word 

Rank 

Count 

Word 

Rank 

Count 

Tag 

1 

426936 

the 

13 

45768 

road 

1 

187605 

london 

2 

275878 

of 

21 

24085 

view 

2 

97696 

england 

3 

189089 

to 

24 

21119 

farm 

3 

96622 

uk 

4 

184705 

a 

32 

17242 

lane 

4 

40528 

2007 

5 

179553 

in 

36 

16232 

hill 

5 

34032 

Scotland 

6 

171429 

and 

37 

16157 

church 

6 

29654 

unitedkingdom 

7 

153707 

on 

38 

15815 

bridge 

7 

24525 

2006 

8 

152091 

is 

43 

14737 

river 

8 

21535 

edinburgh 

9 

141579 

from 

45 

14150 

square 

9 

20215 

ireland 

10 

132451 

this 

48 

13690 

house 

10 

17596 

dublin 

Table  1:  Most  frequent  terms  from  narratives  of  912874  Geograph  photographs 
and  tags  of  759638  Flickr  photographs  for  data  collected  in  a bounding  box 
corresponding  to  the  British  Isles  in  April  2008  (more  details  in  Purves  et  al. 
(2011)). 


this  does  not  remove  problems  of,  for  example,  ambiguity  (e.g.  does  the  tag 
bath  refer  to  a town  in  England  or  a place  to  wash  oneself?). 

One  approach  taken  to  explore  in  more  detail  how  words  or  tags  are  semanti- 
cally related  to  one  another  is  the  use  of  co-occurrence  to  identify  meaningful 
collocations  - an  expression  consisting  of  two  or  more  words  that  correspond 
to  some  conventional  way  of  saying  things’”  (Manning  & Schiitze  1999:  151). 
The  key  task  here  is  to  disentangle  expressions  which  co-occur  by  chance  from 
those  whose  co-occurrence  is  statistically  and  semantically  meaningful. 

A surprisingly  effective  and  efficient  approach  to  this  is  adding  some  form 
of  structure  to  words  or  tags  found  in  a collection  through  annotation.  Such 
annotation  tasks  often  take  the  form  of  the  formulation  of  a set  of  rules,  applied 
independently  by  a group  of  annotators,  in  which  final  decisions  about  class 
membership  is  based  on  some  majority  decision  (e.g.  Purves  et  al.  2011;  Rat- 
tenbury  & Naaman  2009).  Thus,  for  example,  Purves  et  al.  (2011)  generated 
a simple  taxonomy  classifying  words  or  tags  as  elements  (things  that  are  vis- 
ible in  an  image),  qualities  (properties  which  might  modify  an  element  or 
suggest  feelings  or  moods)  and  activities.  Using  this  taxonomy  it  was  then  pos- 
sible to  explore  co-occurrence,  and  identify  both  meaningful  collocations  or 
co-occurrences  (e.g.  steep  hill  or  city  park).  Annotation  tasks  such  as  those 
described  here  can  be  seen  as  substituting  specialised  task-defined  term  dic- 
tionaries for  more  commonly  available,  but  less  specific,  semantic  resources 
such  as  WordNet  (Miller  1995). 
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User  behaviour  and  language 

Other  chapters  in  this  book  concern  themselves  with  issues  of  participation 
inequality  - the  basic  notion  that  a small  number  of  users  contribute  much  of 
the  content  to  most  examples  of  user  generated  content.  The  importance  of  this 
observation  in  analysis  of  georeferenced  annotated  photographs  is  straightfor- 
ward - are  we  analysing  the  way  in  which  many  people  have  described  a par- 
ticular type  of  photograph  (and  their  locations)  - or  the  behaviour  of  only  a 
few?  Thus,  for  example,  tags  describing  trucks  and  lorries  were  the  21st  and 
22nd  most  frequent  in  a collection  of  450,272  photographs  contributed  by  a 
total  of  12,682  users,  but  only  used  by  15  and  7 users  respectively  By  contrast, 
the  most  frequent  tag,  edinburgh,  was  used  by  a total  of  7,427  users,  and  the 
20  most  frequent  tags  were  all  used  by  more  than  300  users.  However,  simply 
being  used  rarely  does  not  per  se  indicate  that  a tag  is  not  meaningful.  In  this 
particular  case  trucks  and  lorries  are  presumably  the  subject  of  interest  of  a 
small  group,  but  this  does  not  mean  that  the  locations  where  they  were  photo- 
graphed are  unrepresentative.  Considering  the  influence  of  individual  users  on 
tag  semantics  is  therefore  an  important,  and  ongoing  research  challenge,  in  the 
analysis  of  annotated  georeferenced  photographs. 

Purves  et  al.  (2011)  explored  tagging  behaviour  by  binning  all  photographs 
contributed  to  a collection,  sorted  by  user  prolificness.  Histrograms  of  indi- 
vidual tag  usage  then  showed  the  proportion  of  tags  contributed  by  more  or 
less  prolific  users,  along  with  z- scores  provided  a summative  value  indicating 
whether  a tag  was  used  in  similar  ways  by  all  contributors  to  a collection.  This 
approach  has  the  advantage  of  allowing  exploration  of  individual  tags,  rather 
than  contributions,  and  their  influence  through  user  behaviour.  Furthermore, 
it  provides  a way  of  dealing  with  bias  caused  by,  for  example,  bulk  uploads,  at 
the  level  of  individual  tags,  rather  than  users. 


Language,  location  and  scale 

In  a book  on  Volunteered  Geographic  Information  it  is  of  course  the  loca- 
tion of  information  which  is  of  primary  interest.  Georeferenced  images  were 
adopted  very  rapidly  by  researchers  in  this  area  because  not  only  were  locations 
explicitly  recorded,  but  the  assumption  that  the  content  was  linked  to  a loca- 
tion is  more  immediate  and  seems  more  realistic  in  describing  images  taken 
somewhere.  However,  issues  of  granularity  quickly  become  apparent,  with  for 
example  the  most  frequent  three  tags  in  a collection  of  1,520,212  images  cap- 
tured within  the  bounding  box  of  Scotland  being  Scotland,  edinburgh,  and 
glasgow  respectively  (Figure  1).  Clearly  Scotland  is  not  wrong,  but  neither  is 
it  informative.  This  problem  is  identical  to  that  illustrated  by  the  top  ten  words 
from  Geograph  in  Table  1 - the  is  indeed  a very  frequent  word,  but  it  isn’t  ter- 
ribly interesting! 
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One  approach  to  identifying  more  interesting  terms  is  to  home  in  on  those 
which  more  effectively  characterise  a document  by  comparing  frequency  of  a 
chosen  term  in  a given  document  to  frequency  across  a corpus  as  a whole.  This 
approach  is  known  as  term  frequency-  inverse  document  frequency  (tf-idf) 
and  is  a baseline  ranking  method  in  information  retrieval.  It  can  be  applied  in 
a geographical  context  by  counting  the  number  of  images  with  a particular  tag 
within  a prescribed  region  (or  cell)  and  comparing  this  with  frequency  over  a 
larger  geographic  region  (Ahern  et  al.  2007;  Rattenbury  & Naaman  2009).  The 
basic  effect  of  geographical  applications  of  tf-idf  is  to  privilege  locally  com- 
mon, but  globally  rare  tags  over  globally  common  tags.  Recognising  the  nature 
of  user  generated  content  and  the  issues  relating  to  user  behaviour  described 
above,  many  researchers  have  added  a term  to  capture  user  frequency  in  this 
characterisation,  typically  ranking  tags  used  by  many  higher  within  in  a region 
(Ahern  et  al.  2007;  Feick  & Robertson  2014;  O’Hare  & Murdock  2012;  Ratten- 
bury & Naaman  2009). 

Obviously  the  size  and  form  of  the  regions  within  which  frequencies  are  cal- 
culated will  have  an  influence  on  the  results.  The  former  property,  size  effec- 
tively captures  notions  of  scale,  while  the  latter,  form,  is  closely  related  to  the 
classical  Modifiable  Areal  Unit  Problem  (MAUP).  To  capture  notions  of  scale 
it  is  important  to  characterise  tag  semantics  at  multiple  scales  (c.f.  Ahern  et  al. 
2007;  Feick  & Robertson  2014;  Rattenbury  & Naaman  2009).  Dealing  with 
MAUP  has  led  to  a number  of  approaches.  Rattenbury  and  Naaman  (2009)  and 
Ahern  et  al.  (2007)  generated  regions  bottom  up,  by  clustering  on  photograph 
positions  themselves  using  K-means.  Feick  and  Robertson  (2014)  imposed  a 
multi-scale  hexagonal  tessellation,  which  they  is  argued  is  better  able  to  capture 
the  complex  geometries  of  real  world  regions.  They  explored  similarity  between 
tag  characterisation  of  connected  hexagons  to  identify  larger  semantic  regions. 

Figure  1 illustrates  some  of  these  notions  for  a dataset  consisting  of  1,520,212 
photographs,  containing  a total  of  53,842  unique  tags  and  captured  by  31,292 
unique  users.  The  ten  most  common  tags  were:  Scotland,  edinburgh,  glas- 
gow,  uk,  united  kingdom,  geotagged,  england,  music,  uploaded:by=flickr_ 
mobile,  and  highlands.  Seven  of  these  are  toponyms,  but  contain  little  or  no 
useful  information  (the  images  were  all  from  within  Scotland’s  bounding  box, 
and  Edinburgh  and  Glasgow  are  simply  the  two  most  populous  cities).  Two 
(geotagged  and  uploaded:by=flickr_mobile)  refer  to  properties  of  the  data 
which  are  self-evident  in  the  first  case  and  refer  to  an  application  used  to  deliver 
data  in  the  second.  Finally,  music  reflects  Flickr’s  popularity  as  a platform  for 
describing  leisure  activities  (Antoniou  et  al.  2010).  Figure  1 ranks  tags  using 
three  methods  discussed  above  for  a square  grid.  Firstly,  tags  are  ranked  using 
only  frequency  and,  as  was  the  case  in  Table  1,  simply  reflect  characteristics  of 
the  collection  as  a whole  (note  the  predominance  of  Scotland).  Secondly,  tf-idf, 
filtered  for  multiple  users  gives  back  a much  more  local  picture,  and  is  domi- 
nated by  more  local  toponyms,  with  the  exception  of  larger  cities,  where  activi- 
ties and  their  locations  (e.g.  fringe  festival,  murrayfield  (rugby),  and  bongo 
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Figure  1:  Top  ranked  tags  for  Flickr  images  in  Scotland’s  bounding  box  according  to  term  frequency,  TF-IDF  and  TF-IDF  filtered 
using  list  of  elements  and  qualities  according  to  [ref].  Only  terms  used  by  a minimum  of  3 users  are  shown.  Regions  are  defined 
as  1 degree  x 1 degree. 

Inset  map  shows  top  three  terms  calculated  by  TF-IDF  for  region  around  Edinburgh  (Region  defined  as  0.2  degree  x 0.2  degree). 
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club  (Nightclub,  gig,  and  events  venue)  for  Edinburgh)  become  visible.  Zoom- 
ing in  to  a more  detailed  grid  using  tf-idf  reveals  finer  granularity  toponyms.  To 
start  to  explore  not  only  the  names  of  the  locations  in  grid  cells,  but  what  sorts 
of  places  these  might  be,  tags  are  filtered  according  to  a structured  list  from 
Purves  et  al.  (2011).  The  resulting  tf-idf  values  show  locations  associated  with, 
for  example,  outdoor  activities  (rural,  wild,  hill)  or  more  urban  locations  and 
activities  (stadium,  allotment,  flat). 

The  techniques  described  so  far  focus  on  tags  independent  of  one  another. 
But,  as  discussed  above  co-occurrence  can  reveal  more  semantically  rich  infor- 
mation (e.g.  castle  ruin  or  tall  building)  and  by  using  (most  profitably)  inter- 
active visualisations  such  co-occurrence  can  be  geographically  located  (Dykes 


Figure  2:  Top  ten  Geograph  terms  describing  elements  and  their  co-occurrence 
with  one  another  presented  as  a spatial  treemap  Dykes  & Wood  (2009).  The 
size  of  a rectangle  indicates  the  overall  count  of  co-occurrences  for  a particu- 
lar term,  while  the  nested  rectangles  indicate  the  relative  predominance  of 
individual  collocates,  and  the  colours  link  these  to  location  - thus,  for  exam- 
ple, the  most  common  terms  used  with  farm  are  hill,  house,  lane  and  road. 
Figure  adapted  from  data  published  in  Purves,  Edwardes  & Wood  (2011). 
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and  Wood  2009),  for  example  by  using  spatial  treemaps  (Dykes  & Wood  2009). 
Spatial  treemaps  are  hierarchical  structures  which  can  show  1)  the  overall 
occurrence  of  an  individual  term,  2)  the  most  commonly  co-occurring  terms 
associated  with  each  term,  and,  when  linked  to  a key  using  colour  3)  the  loca- 
tions of  the  co-occurrences.  Figure  2 shows  co-occurrences  of  the  top  ten  most 
frequent  elements,  together  with  a colour  legend  linking  the  distribution  for 
co-occurring  terms  across  the  British  Isles  in  the  Geograph  dataset.  Such  visu- 
alisations allow  us  to  start  to  explore  the  link  between  particular  sorts  of  loca- 
tions and  their  properties,  for  example  the  relative  importance  of  river  and 
road  with  respect  to  bridge  compared  to  the  importance  of  land,  house,  road, 
and  hill  with  respect  to  farm. 


Recommendations 

The  motivations  for  analysing  geotagged  imagery  are  as  varied  as  its  contribu- 
tors. Thus  the  challenge  lies  not  in  the  analysis  per  se,  but  in  the  initial  process- 
ing of  the  data  and  in  the  interpretation  of  the  results.  Consensus  need  not 
be  a prerequisite  in  extracting  semantics;  just  because  a prolific  user  contrib- 
utes images  of  a highly  thematic  form  does  not  make  that  contribution  biased. 
However,  some  basic  understanding  of  what  properties  in  a collection  might 
be  surprising  and  a related  awareness  for  the  spectrum  of  existing  approaches 
are  both  indispensable.  In  this  short  chapter  we  have  scratched  the  surface  of 
available  methods  - however  we  hope  this  material  and  the  related  references 
will  prove  a useful  starting  point  for  researchers  new  to  the  area. 

Of  course,  the  astute  reader  is  still  waiting  for  a silver  bullet  - but  the  reality  is 
that  all  techniques  should  be  seen  as  exploratory,  and  that  great  care  is  required 
in  the  interpretation  of  these  qualitative  outputs.  Nonetheless,  we  recommend 
the  following  basic  considerations,  which  we  link  here  to  the  questions  set  out 
in  the  introduction: 

• Global  views  on  datasets  allow  an  initial  quick  view  of  datasets  (Ql) 

• Consideration  of  the  meaning  of  tags,  and  an  understanding  of  potential 
ambiguities  can  be  aided  by  simple  methods  such  as  co-occurrence  (Q2) 

• User  behaviours  can  lead  to  significant  biases,  for  example  through  bulk 
uploads  and  users  with  particular  thematic  interests  (Q3) 

• Purely  frequency-based  methods  are  unlikely  to  reveal  interesting  spatial 
patterns  - however,  simple  methods  such  as  tf-idf  can  rapidly  increase  the 
amount  of  information  available  in  collection  (Q4) 

• When  analysing  geographic  data  basic  notions  such  as  scale  and  MAUP 
cannot  be  forgotten  (Q5) 

• Novel  visualisation  techniques  can  provide  useful  insights  and  lead  to  the 
generation  of  new  hypotheses  (Q5) 
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Abstract 

Analysis  of  the  collections  of  geographically  referenced  posts  published  in 
social  media,  such  as  Twitter,  Flickr,  and  YouTube,  can  bring  new  knowledge 
about  places,  geographical  objects,  and  events  interesting  to  people,  and  about 
peoples  mobility  behaviours.  Gaining  knowledge  from  large  data  collections 
requires  combining  computational  analysis  with  human  interpretation,  judge- 
ment, and  reasoning,  which,  in  turn,  require  appropriate  visual  representa- 
tions of  the  data  and  analysis  results.  Visual  analytics  integrates  computational 
analysis  techniques  with  interactive  visual  interfaces  to  support  collaborative 
human-computer  analytical  activities.  We  give  a brief  overview  of  visual  ana- 
lytics approaches  to  extracting  various  kinds  of  information  and  knowledge 
from  georeferenced  social  media  data. 
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Introduction 

Microblogging  services,  such  as  Twitter,  and  services  for  sharing  photo  and 
video,  such  as  Flickr  and  YouTube,  allow  the  users  to  supply  their  posts  with 
geographic  coordinates.  The  high  popularity  of  these  services  in  conjunction 
with  the  widespread  proliferation  of  devices  capable  of  providing  location 
information  has  led  to  great  and  constantly  increasing  volumes  of  location-  and 
time-referenced  data  produced  by  myriads  of  users.  By  analysing  these  data, 
it  is  possible  to  extract  interesting  new  information  about  various  places  and 
events  as  well  as  about  peoples  interests,  mobility  behaviours,  and  life  styles. 

Analysis  of  social  media  data  is  currently  a popular  topic  in  visual  analytics,  a 
research  discipline  that  aims  to  support  synergistic  human-computer  analyti- 
cal workflows  by  combining  computational  analysis  techniques  with  interac- 
tive visual  interfaces  supporting  human  interpretation,  judgement,  and  reason- 
ing (Keim  et  al.  2010).  We  give  a brief  overview  of  the  published  literature  that 
describes  visual  analytics  approaches  to  extracting  different  kinds  of  informa- 
tion from  georeferenced  social  media  data.  Most  of  the  works  do  not  focus  on 
extracting  a single  type  of  information  but  deal  with  several  types. 


Analysis  of  georeferenced  photo  data 

The  photos  published  at  Flickr,  Panoramio,  and  other  photo  sharing  services 
are  supplied  with  metadata,  which  include  the  dates  and  times  of  the  shots 
and  may  also  include  titles  and/or  text  tags  indicating  the  contents  of  the  pho- 
tos. For  many  photos,  the  metadata  include  the  coordinates  of  the  locations 
where  the  photos  had  been  taken.  Collections  of  metadata  records  including 
geographic  coordinates  were  analysed  in  multiple  ways  according  to  the  pos- 
sible analysis  foci  (space  and  place  or  people)  and  respective  tasks  (Andrienko 
et  al.  2009).  The  photo  data  were  considered  from  two  distinct  perspectives: 
as  spatial  events  (independent  points  in  space  and  time)  and  as  trajectories  of 
people  (i.e.  of  the  photo  authors). 


Analysing  photo  taking  events 

In  analysing  the  data  as  spatial  events,  spatial  density-based  clustering  was 
used  for  identifying  popular  places  attracting  much  attention  of  the  photo 
authors.  Visualisation  of  the  times  when  the  photos  had  been  taken  in  these 
places  revealed  different  seasonal  patterns  of  the  place  visits.  To  study  the 
spatial  distribution  of  the  photos  over  a territory  and  compare  the  temporal 
patterns  of  visiting  different  parts  of  it,  the  territory  is  divided  into  compart- 
ments, e.g.  by  a regular  (Andrienko  et  al.  2009)  or  irregular  (Jankowski  et  al. 
2010;  Andrienko  et  al.  2012)  grid,  and  the  photo  taking  events  are  aggregated 
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by  these  compartments  and  time  intervals.  The  resulting  time  series  of  the 
event  counts  are  visualised  on  a map  (Andrienko  et  al  2009)  or  on  a time  graph 
(Jankowski  et  al.  2010;  Andrienko  et  al.  2012),  which  is  linked  to  a map  display 
through  interactive  techniques,  including  synchronous  highlighting,  selection, 
and  filtering  of  corresponding  visual  objects.  By  analysing  the  time  series  using 
either  mostly  interactive  (Jankowski  et  al.  2010)  or  computationally  supported 
(Andrienko  et  al.  2012)  techniques,  the  researchers  detected  places  with  inter- 
esting temporal  patterns  of  visits,  such  as  periodic  peaks  at  particular  times 
of  the  year,  very  high  irregularly  occurring  peaks,  and  significant  increase  of 
place  popularity  starting  from  a particular  time.  To  understand  the  reasons  for 
these  patterns,  the  researchers  extracted  frequently  occurring  words  and  word 
combinations  from  the  titles  of  the  photos  that  had  been  taken  in  the  places 
and  times  of  the  peaks  or  sudden  increases  of  attendance.  In  most  cases,  the 
extracted  words  referred  to  various  public  events  (festivals,  open-air  shows  and 
concerts,  etc.),  but  also  to  interesting  natural  phenomena,  such  as  cherry  tree 
blossoming  or  abundant  snowfalls.  A different  approach  to  identifying  public 
events  and  other  happenings  attracting  peoples  attention  is  by  using  spatio- 
temporal  clustering  of  the  photo  taking  events  (section  6.2.3  of  Andrienko  et  al. 
2013a)  which  finds  occurrences  of  multiple  photos  taken  closely  in  space  and 
time,  i.e.  spatio-temporal  clusters.  For  the  clusters,  frequently  occurring  words 
and  word  combinations  are  extracted  and  investigated  using  a text  cloud  dis- 
play linked  to  a map  (Figure  1). 

Sections  7.2.1 -7.2.5  of  the  book  Andrienko  et  al.  (2013a)  present  an  example 
of  an  in-depth  analysis  of  time  series  of  the  presence  of  distinct  photographers 
by  regions  of  Switzerland.  The  analysis  includes,  among  other  techniques,  visu- 
ally supported  clustering  of  the  time  series  and  interactive  generation  of  models 
for  predicting  the  number  of  photographers  that  can  be  expected  to  visit  the 
regions  in  the  future  at  different  times  of  a year.  The  time  series  can  also  be 
viewed  from  a different  perspective:  as  a sequence  of  spatial  distributions  of  the 
photographers  presence  in  different  time  intervals.  To  study  the  temporal  pat- 
terns of  the  occurrence  of  similar  and  dissimilar  spatial  distribution  patterns,  the 
distributions  are  clustered  by  similarity,  summarized  by  the  resulting  clusters, 
and  compared  using  multiple  map  displays  and  special  interactive  operations 
supporting  comparisons  (section  8.1.1  of  Anrienko  et  al.  2013a).  The  tempo- 
ral distribution  of  the  clusters  is  visually  represented  on  temporal  displays.  The 
provided  example  demonstrates  how  the  analysis  reveals  an  interaction  between 
temporal  periodicity  and  temporal  trends  in  the  sequence  of  the  spatial  distribu- 
tions of  the  presence  of  Flickr  photographers  over  the  territory  of  Switzerland. 


Analysing  trajectories  of  photo  authors 

Trajectories  of  people  can  be  constructed  from  georeferenced  photo  data  by 
arranging  the  records  of  each  individual  photographer  in  a chronological 
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Figure  1:  Top:  the  frequent  occurrences  of  words  and  combinations  in  the 
photo  titles  within  spatio-temporal  clusters  of  Flickr  photos  are  represented 
on  a map  by  point  symbols  coloured  according  to  the  spatial  positions  of  the 
clusters.  Bottom:  the  words  and  combinations  are  represented  in  a text  cloud 
display,  the  font  sizes  being  proportional  to  the  frequencies  and  the  colours 
corresponding  to  the  spatial  locations,  as  in  the  map.  One  of  the  word  com- 
binations (‘Interlaken  red  bull  air  race ) is  selected  in  the  text  cloud  view  by 
mouse-pointing;  the  corresponding  point  is  highlighted  on  the  map  (marked 
with  an  arrow).  Source:  Andrienko  et  al.  2013a. 

sequence  (the  same  idea  applies  to  any  kind  of  georeferenced  data  that  include 
identifiers  of  individuals,  in  particular,  to  data  from  YouTube,  Twitter,  and 
other  social  media).  Trajectories  of  individuals  can  be  aggregated  into  flows 
between  compartments  of  a territory  division  and  visualised  on  flow  maps 
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to  enable  studying  of  mass  movement  patterns  (Andrienko  et  al.  2009).  The 
aggregation  of  trajectories  into  flows  can  be  done  by  time  intervals  for  study- 
ing seasonal  differences  between  the  mass  movement  patterns  (Jankowski  et 
al.  2010).  A set  of  trajectories  can  also  be  analysed  for  discovering  frequent 
sequences  of  place  visits  (section  7.3.4  of  Andrienko  et  al.  2013a).  The  extracted 
frequent  sequences  can  be  explored  using  a text  cloud  display  combined  with 
an  interactive  map  and  a space-time  cube.  By  analysing  peoples  trajectories, 
one  can  also  detect  meetings  of  two  or  more  individuals,  including  repeated 
meetings  of  the  same  pairs  or  groups  of  individuals,  and  joint  trips  of  two 
or  more  photographers  (Andrienko  et  al.  2009);  however,  performing  such 
analyses  may  be  unethical,  as  they  may  compromise  the  personal  privacy  of 
the  individuals. 

This  overview  gives  an  idea  about  the  diversity  of  the  possible  approaches  to 
analysing  georeferenced  photo  data  and  the  kinds  of  information  and  knowl- 
edge that  can  be  extracted  from  such  data.  The  same  range  of  approaches  is 
also  applicable  to  georeferenced  microblogging  data,  such  as  data  from  Twit- 
ter. The  types  of  information  that  can  be  extracted  from  the  two  different 
sources  of  data  are  the  same  but  the  interpretation  may  be  different.  Thus, 
people  mostly  take  photos  when  they  encounter  interesting  places,  objects,  or 
events;  besides,  not  all  taken  photos  but  only  the  best  or  the  most  interesting 
ones  may  be  published.  It  should  also  be  taken  into  account  that  photos  are 
rarely  taken  in  low  light  conditions,  and  that  there  are  situations  and  places  in 
which  taking  photos  is  prohibited.  Therefore,  the  photo  data  cannot  be  con- 
sidered representative  of  peoples  presence  and  movements  over  a territory 
and  of  peoples  everyday  activities.  Figure  1 shows  that  photo  data  may  reflect 
peoples  leisure  activities  and  touristic  travels.  However,  it  would  be  wrong 
to  assume  that  this  is  always  the  case.  The  possible  relation  of  the  published 
photos  to  the  authors  leisure  time,  travels,  or  professional  activities  can  be 
judged  from  the  temporal  frequency  and  regularity  of  the  photos  and  from 
their  spatial  distribution. 


Analysis  of  georeferenced  microblog  data 

Posting  microblog  messages  from  mobile  devices  may  occur  more  frequently 
and  spontaneously  and  in  a wider  range  of  places  and  situations  than  tak- 
ing and  publishing  photos.  Besides,  there  is  no  time  gap  between  producing 
and  publishing  a message,  while  photo  authors  may  not  publish  their  photos 
immediately  after  taking  but  may  do  this  after  some  (often  quite  long)  time. 
Therefore,  unlike  photos,  microblog  data  are  suitable  for  real  time  analysis, 
which  may  discover  information  about  currently  happening  events,  in  par- 
ticular, abnormal  and  disastrous  events,  such  as  earthquakes  or  storms  (Chae 
et  al.  2012;  Andrienko  et  al.  2014).  This  requires  processing  of  the  message 
texts.  One  of  the  approaches  is  pre-filtering  of  the  messages  for  selecting 
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only  those  that  contain  analysis -relevant  keywords,  such  as  terms  denot- 
ing extreme  weather  conditions  (Andrienko  et  al.  2014).  Another  approach 
is  extracting  significant  terms,  i.e.  such  words  that  do  not  occur  frequently 
in  microblog  messages  in  general  or  in  the  times  (seasons)  or  places  where 
they  have  occurred  (Chae  et  al.  2012;  Bosch  et  al.  2013).  Each  occurrence 
of  a significant  term  is  treated  as  a separate  spatial  event.  A spatio-temporal 
concentration  (cluster)  of  events  with  the  same  term  may  indicate  that  some- 
thing is  happening  in  this  place  and  time,  and  the  term  gives  an  idea  of  what 
may  be  happening.  The  significant  terms  from  such  spatio-temporal  clusters 
are  shown  on  a map  display  using  the  text  cloud  technique,  with  the  font  size 
being  proportional  to  the  number  of  the  term  occurrences.  The  map  is  con- 
stantly updated  in  real  time  as  new  messages  appear.  By  means  of  an  interac- 
tive tool  called  Content  Lens,  the  user  can  select  a particular  area  and  explore 
in  more  detail  the  term  occurrences  in  this  area.  To  increase  the  relevance  of 
the  information  that  is  shown  to  the  user,  various  user-constructed  filters  can 
be  applied  to  the  data  (Bosch  et  al.  2013).  In  the  other  approach  (Anrienko 
et  al.  2014),  the  message  texts  are  only  used  for  the  selection  of  potentially 
relevant  messages  and  not  used  in  the  further  analysis.  The  work  focuses  on 
real  time  detection  of  spatio-temporal  clusters  of  relevant  events,  taking  into 
account  only  the  event  locations  and  times  but  not  the  texts,  and  on  tracing 
the  cluster  evolution  (growing,  shrinking,  moving,  merging,  and  splitting) 
over  time  (Figure  2).  The  individual  events  making  the  clusters  and  their  mes- 
sage texts  can  be  accessed  on  demand. 

An  example  of  an  offline  investigation  of  microblog  posts  related  to  a dis- 
astrous event  (an  epidemic)  is  presented  in  section  6.3.2  of  Andrienko  et  al. 
(2013a).  Although  it  uses  data  generated  synthetically  (based  on  real  data),  it 
shows  the  principal  possibility  of  using  microblog  data  for  identifying  the  ori- 
gin and  possible  cause  of  an  epidemic,  the  ways  of  disease  propagation,  the 
spatial  spread,  and  the  evolution  over  time. 

However,  detecting  and  investigating  disastrous  or  abnormal  happenings  is 
not  the  only  possible  use  case  for  microblog  data.  Georeferenced  microblog 
posts,  at  least  those  from  active  bloggers,  may  to  some  extent  be  considered 
as  representative  of  the  peoples  daily  lives  and  used  for  studying  peoples 
behaviours.  Thus,  an  analysis  of  a collection  of  Tweets  posted  by  residents  of 
the  Seattle  area  (USA)  revealed  interesting  patterns  of  collective  and  individual 
behaviours  (Andrienko  et  al.  2013b).  For  this  analysis,  the  Tweets  were  classi- 
fied according  to  their  topics,  such  as  family,  work,  education,  food,  sports,  etc. 
based  on  the  occurrences  of  topic- specific  keywords  (for  example,  the  topic 
‘family’  is  associated  with  the  terms  denoting  family  members:  mother,  mom, 
father,  daddy,  and  so  on).  The  researchers  explored  how  much  the  Tweet  topics 
are  related  to  the  locations  from  which  the  Tweets  were  posted  and  to  the  times 
when  this  happened.  For  this  purpose,  they  aggregated  the  Tweets  by  the  topics, 
areas  in  space,  and  time  intervals  and  visually  explored  the  results  using  maps 
and  time  histograms.  It  was  found  that  there  are  areas  where  particular  topics 
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Source:  Andrienko  et  al.  (2014). 


Circle  area  is 
proportional  to  value: 
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of  the  topic  ‘transportation.  Source:  Andrienko  et  al.  (2013b). 
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prevail,  which  may  be  related  to  the  kinds  of  objects  or  facilities  located  in  the 
areas  (e.g.  a university  or  a stadium)  or  to  the  characteristics  of  the  population 
(e.g.  an  international  district;  see  Figure  3,  left).  The  researchers  also  looked  at 
the  spatial  distributions  of  the  different  topics  and  found  that  some  of  them  are 
correlated  with  the  distribution  of  certain  kinds  of  objects  or  facilities.  Thus,  the 
topic  ‘transportation  occurs  along  the  main  transportation  corridors  (Figure  3, 
right).  Regarding  the  temporal  distributions  of  the  Tweet  topics,  the  research- 
ers found  several  very  interesting  patterns  of  when  certain  topics  occupy  the 
peoples’  minds.  Thus,  ‘food’  occurs  more  frequently  during  lunch  and  dinner 
times,  ‘coffee’  during/ after  breakfast  and  over  the  forenoon,  ‘transportation 
during  working  day  rush  hours,  and  ‘sports’  and  ‘alcohol’  in  the  evenings  and 
over  the  weekend. 

Although  the  study  shows  that  the  contents  of  some  microblog  posts  are 
related  to  the  places  the  authors  visit  and/or  the  activities  they  perform,  these 
data  in  general  contain  a large  proportion  of  noise,  which  includes  texts  with 
unidentifiable  topics  and  texts  with  topics  that  are  not  relevant  to  the  places  of 
message  posting  (thus,  a person  may  Tweet  about  work  while  being  at  home 
or  about  food  while  travelling  in  public  transport).  In  fact,  the  proportion  of 
noise  outweighs  the  proportion  of  potentially  relevant  data.  Therefore,  it  makes 
sense  to  analyse  the  topic  distribution  in  space  and  time  at  the  level  of  a large 
population  of  microbloggers,  to  have  a sufficiently  large  amount  of  potentially 
relevant  data  and  to  be  able  to  use  valid  statistical  summaries.  At  the  level  of 
individuals,  the  message  texts  can  hardly  be  indicative  of  the  individuals’  activi- 
ties or  purposes  for  visiting  different  places. 

In  analysing  mobility  behaviours  of  individuals,  it  is  reasonable  to  look  not 
at  the  message  texts  but  at  the  temporal  patterns  of  visiting  different  places 
(Andrienko  et  al.  2015).  Significant  (repeatedly  visited)  personal  places  are 
extracted  from  the  collection  of  posts  of  each  individual  by  spatial  clustering 
of  the  post  locations.  Place  semantics  (i.e.  the  meanings,  purposes  for  visiting, 
or  activities  performed  in  the  places)  can  be  determined  based  on  the  times 
over  the  weekly  cycle  when  the  individuals  were  present  in  the  places.  Thus, 
a place  where  a person  is  present  in  the  evenings  and  nights  of  all  days  can 
be  identified  as  the  person’s  home  place.  However,  separate  consideration  of 
the  data  of  each  individual  is  unfeasible  and  harmful  for  the  personal  privacy. 
The  paper  of  Andrienko  et  al.  (2015)  proposes  a privacy- respecting  approach, 
in  which  data  of  a large  number  of  Twitter  users  are  analysed  all  together 
using  a combination  of  computational  techniques  and  visualisations  present- 
ing the  data  and  analysis  result  in  aggregated  form.  After  extracting  personal 
places  and  identifying  their  meanings  in  this  manner,  the  original  georefer- 
enced data  are  transformed  to  trajectories  in  an  abstract  semantic  space.  The 
semantically  abstracted  data  can  be  further  analysed  without  the  risk  of  re- 
identifying  people  based  on  the  specific  places  they  attend.  The  paper  presents 
an  example  of  analysing  mobility  behaviours  of  Twitter  users  in  the  area  of 
San  Diego  (USA). 
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Conclusion 

To  summarise,  georeferenced  data  from  social  media  can  be  analysed  as 
spatial  events  (i.e.  independent  points  in  space  and  time)  and  as  trajecto- 
ries of  people.  To  analyse  such  data,  visual  analytics  proposes  a number  of 
approaches  combining  computational  techniques  (clustering,  aggregation, 
statistical  summarisation,  pattern  detection,  etc.)  with  interactive  visualisa- 
tions. With  these  approaches,  it  is  possible  to  extract  interesting  information 
and  gain  new  knowledge  about  places,  events,  and  peoples  interests,  behav- 
iours, and  habits.  Metadata  of  the  photos  published  through  photo  sharing 
services  can  reveal  peoples  interests  to  tourist  attractions,  public  events  and 
other  happenings,  or  natural  phenomena  and  patterns  of  touristic  behaviour. 
Georeferenced  microblog  posts  can  be  analysed  in  real  time  for  early  detec- 
tion of  abnormal  or  disastrous  events.  It  may  also  be  useful  to  analyse  the 
evolution  of  such  events  by  looking  at  the  spatio-temporal  distribution  of  the 
event-related  posts.  Besides  the  information  concerning  unusual  happenings, 
microblog  data  may  be  a source  of  knowledge  about  everyday  mobility  and 
activities  of  people.  As  both  the  popularity  of  the  social  media  and  the  interest 
to  analysing  social  media  data  are  growing,  we  can  expect  the  appearance  of 
new  analysis  methods  and  new  use  cases  for  information  that  can  be  extracted 
by  these  methods. 
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Abstract 

The  things  surrounding  us  vary  dramatically,  which  implies  that  there  are  far 
more  small  things  than  large  ones,  e.g.,  far  more  small  cities  than  large  ones 
in  the  world.  This  dramatic  variation  is  often  referred  to  as  fractal  or  scaling. 
To  better  reveal  the  fractal  or  scaling  structure,  a new  classification  scheme, 
namely  head/tail  breaks,  has  been  developed  to  recursively  derive  different 
classes  or  hierarchical  levels.  The  head/tail  breaks  works  as  such:  divide  things 
into  a few  large  ones  in  the  head  (those  above  the  average)  and  many  small 
ones  (those  below  the  average)  in  the  tail,  and  recursively  continue  the  divi- 
sion process  for  the  large  ones  (or  the  head)  until  the  notion  of  far  more  small 
things  than  large  ones  has  been  violated.  This  paper  attempts  to  argue  that 
head/tail  breaks  can  be  a powerful  visualization  tool  for  illustrating  structure 
and  dynamics  of  natural  cities.  Natural  cities  refer  to  naturally  or  objectively 
defined  human  settlements  based  on  a meaningful  cutoff  averaged  from  a mas- 
sive amount  of  units  extracted  from  geographic  information.  To  illustrate  the 
effectiveness  of  head/tail  breaks  in  visualization,  I have  developed  some  case 
studies  applied  to  natural  cities  derived  from  the  points  of  interest,  and  social 
media  location  data.  I further  elaborate  on  head/tail  breaks  related  to  fractals, 
beauty,  and  big  data. 
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Introduction 

The  things  surrounding  us  vary  dramatically,  which  implies  that,  instead  of 
more  or  less  similar  things,  there  are  actually  far  more  small  things  than  large 
ones,  e.g.,  far  more  small  cities  than  large  ones  in  the  world.  This  dramatic  vari- 
ation is  often  referred  to  fractal  or  scaling,  and  is  well  captured  by  geographic 
information  of  various  kinds.  Reflected  in  the  points  of  interest  (POI),  there  are 
far  more  POI  in  cities  than  in  countryside;  in  terms  of  social  media,  there  are  far 
more  users  in  cities  than  in  the  countryside.  The  new  kind  of  geographic  infor- 
mation constitutes  what  we  now  call  big  data  (Mayer-Schonberger  and  Cukier 
2013)  in  contrast  to  conventional  small  data.  Unlike  small  data  (e.g.,  census  or 
statistical  data),  which  are  often  estimated  and  aggregated,  geographic  infor- 
mation in  the  big  data  era  is  accurately  and  precisely  measured  at  an  individual 
level.  This  kind  of  geographic  information  due  to  its  diversity  and  heterogeneity 
is  likely  to  show  the  scaling  pattern  of  far  more  small  things  than  large  ones.  To 
better  reveal  the  scaling  structure,  a new  classification  scheme,  namely  head / 
tail  breaks  (Jiang  2013a),  has  been  developed  to  recursively  derive  inherent 
classes  or  hierarchical  levels.  It  divides  things  around  an  average,  according  to 
their  geometric,  topological  and/or  semantic  properties,  into  a few  large  ones 
in  the  head  (those  above  the  average)  and  many  small  ones  (those  below  the 
average)  in  the  tail,  and  recursively  continues  the  division  process  for  the  large 
ones  (or  the  head)  until  the  notion  of  far  more  small  things  than  large  ones  has 
been  violated  (c.fi,  Section  2 for  a working  example). 

Natural  cities  refer  to  naturally  and  automatically  derived  human  settle- 
ments, or  human  activities  in  general  on  the  earths  surface,  based  on  a mean- 
ingful cutoff  averaged  from  a massive  amount  of  units  extracted  from  massive 
geographic  information.  For  example,  we  build  up  a huge  triangulated  irregular 
network  (TIN  - a digital  data  structure  commonly  used  for  the  representation 
of  a surface)  consisting  of  one-day  Tweets  locations  indicated  by  GPS  coordi- 
nates around  the  world.  It  is  obvious  that  with  the  TIN  there  are  far  more  short 
edges  than  long  ones.  The  average  length  of  the  edges  splits  all  the  edges  into 
two  parts:  a minority  of  long  edges  (longer  than  the  average)  in  the  head,  and 
a majority  of  short  edges  (shorter  than  the  average)  in  the  tail  of  the  rank- size 
plot  (Zipf  1949).  Aggregate  all  short  edges  to  create  thousands  of  natural  cit- 
ies around  the  world.  The  natural  cities  emerge  from  a collective  decision  of 
diverse,  independent,  and  heterogeneous  TIN  edges,  thus  manifesting  some 
wisdom  of  crowds  (Surowiecki  2004).  Interestingly,  the  natural  cities  demon- 
strate striking  fractal  structure  and  nonlinear  dynamics  (Jiang  and  Miao  2015). 
While  conventional  cities  imposed  by  authorities  from  the  top  down  are  of 
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great  use  for  administation  and  mangement,  natural  cities  defined  from  the 
bottom  up  are  of  more  use  for  studying  the  underlying  structure  and  dynamics. 
Natural  cities  are  not  constrained  to  individual  countries,  but  are  universally 
defined  and  delineated  for  the  entire  world  with  support  of  big  data.  Because 
of  the  unversality,  natural  cities  defined  at  very  fine  spatial  and  temporal  scales 
provide  a useful  means  for  scientific  research. 

This  paper  attempts  to  develop  an  argument  that  in  the  big  data  era  head/tail 
breaks  can  become  an  efficient  and  effective  visualization  tool  for  illustrating 
structure  and  dynamics  of  natural  cities.  The  fundamental  logic  of  this  argu- 
ment is  as  such.  A large  number  of  natural  cities  as  a whole  can  be  classified  into 
different  hierarchical  levels  or  classes.  Instead  of  showing  all  the  classes  or  the 
whole,  we  can  deliberately  drop  out  some  low  classes,  yet  without  distorting  the 
underlying  scaling  pattern  of  the  whole.  This  is  because  the  remaining  classes  as 
a sub-whole  are  self-similar  to  the  whole.  This  logic  applies  to  the  time  dimen- 
sion as  well,  i.e.,  instead  of  showing  all  evolving  patterns  along  a time  line,  we 
deliberately  choose  a part  that  reflects  the  whole.  Head/tail  breaks  provides  a 
simple  instrument  that  helps  us  see  fractals  in  nature  and  society,  i.e.,  through 
examining  whether  there  are  far  more  small  things  than  large  ones,  or  more 
precisely  whether  the  scaling  pattern  recurs  multiple  times  with  Ht-index  being 
at  least  3 (Jiang  and  Yin  2014).  Conventionally,  we  must  compute  the  fractal 
dimension  to  determine  whether  a set  or  pattern  is  fractal  (Mandelbrot  1982). 
Fractal  dimension  (D)  is  rigorously  defined,  referring  to  the  ratio  of  the  change 
of  details  (N)  to  that  of  measuring  scale  (r),  D = log(N)/log(r).  Following  the 
rigorous  definition,  fractals  are  found  to  appear  in  a variety  of  phenomena  such 
as  mountains,  trees,  clouds,  rivers,  cities,  streets,  architectures,  the  Internet,  the 
World  Wide  Web,  social  media,  and  even  the  paintings  of  Jackson  Pollock  (e.g., 
Batty  and  Longley  1994,  Eglash  1999,  Taylor  2006).  Now  with  head/tail  breaks, 
not  only  experts,  but  also  the  general  public  can  simply  judge  the  ubiquity  of 
fractals  relying  on  our  intuitions,  i.e.,  a set  or  pattern  is  fractal  if  the  scaling  pat- 
tern of  far  more  small  things  than  large  ones  recurs  multiple  times. 

The  remainder  of  this  paper  is  structured  as  follows.  Section  2 introduces 
head/tail  breaks  and  discusses  how  it  leads  to  a new  definition  of  fractals 
using  the  Sierpinski  carpet  and  Mandelbrot  set  as  working  examples.  Section 
3 reports  several  case  studies  applied  to  visualization  of  natural  cities  derived 
from  POI  and  social  media  data.  Section  4 adds  some  further  discussions  on 
head/ tail  breaks  to  meet  challenges  from  big  data.  Finally  Section  5 concludes 
the  paper,  and  points  to  future  work. 


Head/ tail  breaks  leading  to  a new  definition  of  fractals 

Head/tail  breaks  is  largely  motivated  by  heavy- tailed  distributions  such  as 
power  law,  lognormal,  and  exponential  distributions  (c.f.  Section  4 for  a dis- 
cussion) to  derive  inherent  classes  or  hierarchical  levels.  The  resulting  number 
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of  classes  is  given  by  another  term  called  Ht-index  (Jiang  and  Yin  2014),  as  an 
alternative  index  to  fractal  dimension  for  characterizing  the  complexity  of  frac- 
tals. The  higher  the  Ht-index,  the  more  complex  the  fractals.  Before  illustrating 
its  visualization  capability,  I shall  briefly  introduce  head/tail  breaks  using  the 
working  example  of  the  Sierpinski  carpet. 

The  Sierpinski  carpet,  as  a classic  plane  fractal,  contains  far  more  small  squares 
than  large  ones,  i.e.,  1,  8,  and  64  squares  with  respect  to  sizes  1/3,  1/9  and  1/27 
given  the  carpet  of  one  unit  (Figure  1).  The  fractal  dimension  of  the  Sierpinski 
carpet  can  be  calculated  by  D = log(8)/log(3)  = 1.893,  which  indicates  that  every 
time  the  scale  (r)  is  reduced  three  times,  the  number  of  squares  (N)  increases 
eight  times.  The  calculation  may  look  somewhat  abstract  and  hard  to  grasp. 
Now  let  us  take  a simpler  and  easier  way.  There  are  far  more  small  squares  than 
large  ones;  at  the  smallest  end  there  are  64  squares  sized  1/27,  at  the  largest  end  1 
square  sized  1/3,  and  in  the  middle  of  the  two  ends  8 squares  sized  1/9.  If  we  cre- 
ate a scatterplot  of  these  three  points  in  an  Excel  sheet  and  fit  them  into  a power 
function,  one  would  observe  y = 0.125  x A -1.893  (see  Figure  1).  This  is  called 
the  Richardson  plot,  showing  the  ratio  of  the  change  of  details  (N)  to  the  change 
of  scales  (r).  In  the  Richardson  plot,  three  points  are  exactly  on  the  distribution 
line,  implying  that  the  Sierpinski  carpet  is  strict  fractal,  or  alternatively,  the  parts 
are  strictly  self- similar  to  the  whole.  If  we  replaced  the  squares  with  city  sizes, 
the  points  would  be  around  rather  than  exactly  on  the  distribution  line.  This  is 
because  city  sizes  are  just  statistically  fractal  rather  than  strictly  fractal. 

Now  let  us  examine  how  head/tail  breaks  works  for  the  Sierpinski  carpet.  There 
are  a total  of  1 + 8 + 64  = 73  squares,  and  the  average  size  of  which  is  calculated  by 


Figure  1:  (Color  online)  Illustrstion  of  head/tail  breaks  and  fractal  dimension 
using  the  Sierpinski  carpet. 

Note:  There  are  far  more  small  squares  than  large  ones  for  the  Sierpinski  carpet. 
The  Richardson  plot  shows  the  fractal  dimension,  while  the  nested  rank- size 
plots  demonstrate  the  head/tail  breaks  process  or  the  Ht-index. 
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# Squares 

Mean 

# head 

# tail 

% head 

% tail 

73 

0.0492 

9 

64 

12% 

88% 

9 

0.1358 

1 

8 

11% 

89% 

Table  1:  The  head/tail  breaks  for  the  Sierpinski  squares. 


m1  = (1/3  * 1 + 1/9  * 8 + 1/27  * 64)  / (1  + 8 + 64)  = 0.0492.  This  first  mean  can  split 
all  the  73  squares  into  two  unbalanced  parts:  a small  portion  of  the  large  squares 
(nine  squares  larger  than  the  mean)  in  the  head,  and  a big  portion  of  the  small 
squares  (64  squares  smaller  than  the  mean)  in  the  tail.  For  the  nine  squares  in  the 
head,  their  average  size  is  calculated  by  m2  = (1/3  *1  + 1/9  * 8)  / (1  + 8)  = 0.1358. 
This  second  mean  can  split  all  nine  squares  into  two  unbalanced  parts:  a small 
portion  of  the  large  squares  (one  square  larger  than  the  mean)  in  the  head,  and 
a large  portion  of  the  small  squares  (eight  squares  smaller  than  the  mean)  in  the 
tail  (Table  1).  The  above  calculation  indicates  that  the  pattern  of  far  more  small 
squares  than  large  ones  recurs  twice,  and  therefore  Ht-index  = 2 + 1 = 3.  The 
recurring  scaling  pattern  is  also  shown  in  the  nested  rank-size  plots  in  Figure  1. 
The  Ht-index  is  indeed  3 because  there  are  only  three  scales:  1/3,  1/9,  and  1/27. 
If  we  added  512  squares  of  the  smaller  size  1/81,  the  Ht-index  would  increase  by 
one,  but  the  fractal  dimension  would  remain  unchanged.  From  this,  we  see  how 
Ht-index  complements  fractal  dimension  in  capturing  the  complexity  of  fractals. 

As  the  above  example  shows,  head/tail  breaks  is  quite  simple  and  straight- 
forward, i.e.,  given  that  there  are  far  more  small  things  than  large  ones,  split 
things  into  a few  large  and  many  small,  and  recursively  continue  the  splitting 
for  the  large  until  the  notion  of  far  more  small  things  than  large  ones  is  violated. 
Importantly,  head/tail  breaks  leads  to  a relaxed  definition  of  fractals:  a set  or 
pattern  is  fractal  if  the  notion  of  far  more  small  things  than  large  ones  recurs 
multiple  times,  Ht-index  >=  3.  This  new  definition  based  on  the  head/tail  breaks 
is  pretty  intuitive,  and  may  help  refine  our  eyes  or  improve  our  intuitions  for 
fractals.  As  remarked  by  Mandelbrot  (1982),  the  most  important  instrument  of 
thought  is  the  eye  rather  than  mathematical  formula.  With  the  new  definition, 
anyone  with  little  mathematical  knowledge  can  easily  rely  on  his/her  intuitions 
to  determine  whether  something  is  fractal.  Now  let  us  examine  whether  our 
intuitions  have  been  improved  with  reference  to  the  Mandelbrot  set  in  Figure  2. 

It  is  well  known  that  the  Mandelbrot  set,  probably  the  most  complex  shape 
known  to  man,  comes  from  the  amazingly  simple  equation:  z = z A 2 + c. 
Despite  the  simplicity,  I have  decided  not  to  consider  the  underlying  math- 
ematics; interested  readers  can  refer  to  the  literature  for  more  details  (e.g., 
Mandelbrot  2004).  Instead,  let  us  rely  on  our  intuitions  by  concentrating  on  the 
Mandelbrot  set  shape  and  an  infinite  number  of  convoluted  Julia  sets  shapes  it 
generated,  some  of  which  are  shown  in  Figure  2.  What  the  stunning  images  of 
these  shapes  have  in  common  is  the  ubiquity  of  far  more  small  things  than  large 
ones.  The  Mandelbrot  set  can  be  zoomed  into  deeply  to  find  similar  patterns 


1 74  European  Handbook  of  Crowdsourced  Geographic  Information 


Figure  2:  (Color  online)  Ubiquity  of  far  more  small  things  than  large  ones  in 
both  the  Mandelbrot  set  (Panel  0)  and  the  Julia  sets  (Panels  1 to  6). 

Note:  There  are  an  infinite  number  of  bulbs  tangent  to  the  main  cardioid  of  the 
Mandelbrot  set.  The  Julia  sets  in  Panels  1 and  2 are  generated  from  within  the 
bulbs  (black  in  Panel  0),  whereas  the  Julia  sets  in  Panels  3 to  6 are  generated 
from  outside  of  the  bulbs  (color  in  Panel  0).  The  figures  in  Panel  0 indicate  the 
approximate  locations  where  the  Julia  sets  are  generated. 


again  and  again  infinitely,  so  the  Mandelbrot  set  can  be  said  to  be  “big  data”. 
The  Mandelbrot  set  (Panel  0)  contains  far  more  small  bulbs  than  large  ones,  as 
do  the  two  related  Julia  sets  (Panel  1-2)  generated  from  within  the  bulbs  (black 
in  Panel  0)  of  the  Mandelbrot  set.  The  Julia  sets  generated  from  outside  the 
bulbs  (color  in  Panel  0)  of  the  Mandelbrot  set  have  some  dramatically  different 
shapes  and  colorful  images  (Panel  3-6),  which  clearly  evoke  a sense  or  intuition 
that  there  are  far  more  small  structures  than  large  ones.  The  images  also  look 
beautiful.  Note  that  it  is  essentially  not  the  colors  but  the  underlying  fine  struc- 
tures (or  recurring  pattern  of  far  more  small  things  than  large  ones)  that  make 
the  patterns  beautiful  (Alexander  2002);  see  Section  4 for  a further  discussion. 


Visualization  of  city  structure  and  dynamics 

When  the  social  scientist  Jacob  L.  Moreno  (1934)  first  studied  such  human 
relationships  as  likes  and  dislikes,  his  dream  was  to  map  them  for  a whole  city 
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or  nation.  It  now  appears  that  what  he  dreamed  of  has  been  fully  realized,  not 
only  for  a whole  nation,  but  for  the  entire  world,  with  millions  or  billions  of 
people  connected  through  social  media  such  as  Facebook  and  Twitter.  The 
New  York  Times  praised  Moreno’s  work  as  a new  human  geography  (Jones 
1933),  because  the  map  metaphor  was  used  for  portraying  the  acquired  human 
relationships,  with  nodes  for  individuals,  links  for  relationships  between  the 
individuals,  red  lines  for  liking,  black  lines  for  disliking,  triangles  for  boys, 
and  circles  for  girls.  This  semiology,  together  with  the  methods  of  data  col- 
lection and  data  analysis,  were  typical  social  science  methods  in  the  age  of 
data  scarcity,  or  the  so-called  small  data  era.  Nowadays,  we  have  entered  an 
unprecedented  big  data  era,  in  which  we  are  overwhelmed  by  crowdsourcing 
data,  accumulated  in  social  media  and  contributed  by  individuals  (Goodchild 
2007,  Kwak  et  al.  2010,  Gao  and  Liu  2014).  In  addition,  advanced  geospatial 
technologies  have  already  produced  a large  amount  of  geographic  information 
such  as  satellite  images  (National  Research  Council  2003).  Big  data  requires 
new  ways  of  thinking  (Jiang  2015b)  in  order  to  better  understand  the  underly- 
ing social  and  geographic  structure  and  how  the  structure  evolves  over  time. 
In  this  connection,  visualization  offers  a powerful  means  to  reach  the  better 
understanding. 


Natural  cities  derived  from  POI 

Points  of  interest  (POI)  are  spread  across  countries,  particularly  within  cities, 
represent  interesting  locations  or  facilities  such  as  churches,  schools,  shops, 
and  pubs.  As  a wiki-like  collaboration  to  create  a free  editable  map  of  the  world, 
OpenStreetMap  (Bennett  2010)  has  integrated  millions  of  POI,  including  basic 
categories  such  as  automotive,  eating  and  drinking,  government  and  public 
services,  health  care,  and  leisure.  In  this  study,  I took  approximately  2 mil- 
lion POIs  for  the  three  European  countries:  France,  Germany,  and  the  United 
Kingdom  from  CloudMade  (http://download.cloudmade.com/).  Following  the 
same  procedure  of  extracting  natural  cities  introduced  in  the  previous  work 
(Jiang  and  Miao  2014),  we  built  a huge  TIN  for  each  country,  and  then  derived 
natural  cities  for  further  scaling  analysis.  Table  2 presents  the  basic  statistics 
about  the  derived  natural  cities.  France,  for  example,  has  280,1 17  POI,  of  which 
254,008  unique  points  were  used  to  generate  a huge  TIN  with  835,009  edges. 
There  are  far  more  short  edges  than  long  edges,  so  the  distribution  is  clearly 
L- shaped.  I applied  the  head/ tail  division  rule  (Jiang  and  Liu  2012)  into  the 
massive  number  of  edges,  which  resulted  in  two  unbalanced  parts:  those  above 
the  mean  in  the  head,  and  those  below  the  mean  in  the  tail.  All  those  edges  in 
the  tail  were  aggregated,  leading  to  the  9,391  natural  cities.  Figure  3 shows  the 
resulting  natural  cities  (Panel  1),  together  with  those  for  the  other  two  coun- 
tries (Panels  2 and  3).  Germany  is  the  densest  country  in  terms  of  both  POI  and 
natural  cities,  followed  by  the  UK. 


1 76  European  Handbook  of  Crowdsourced  Geographic  Information 


France 

Germany 

UK 

POI 

280,117 

1,299,638 

505,051 

Unique  POI 

254,008 

977,357 

462,424 

TINEdge 

835,009 

3,238,695 

1,511,023 

Natural  cities 

9,391 

48,830 

16,814 

Ht- index/hierarchy 

6 

7 

6 

Hierarchy  shown 

4 

4 

3 

Table  2:  Basic  statistics  about  the  natural  cities  derived  from  POL 


There  are  far  more  small  natural  cities  than  large  ones  in  terms  of  the  num- 
bers of  POI  they  contain.  To  effectively  visualize  the  underlying  scaling  hier- 
archy of  the  natural  cities,  I applied  the  head/tail  breaks  to  computing  the  Ht- 
index  that  is  shown  in  Table  2.  France,  Germany,  and  the  UK,  respectively,  have 
six,  seven,  and  six  hierarchical  levels  or  classes.  If  all  the  classes  were  displayed 
by  different  sizes  of  red  dots,  no  matter  how  small  they  are,  the  patterns  would 
not  be  recognizable  like  hairballs.  Instead,  I chose  the  top  four  or  three  classes 
(Panels  4,  5,  and  6 of  Figure  3),  which  reflect  the  same  scaling  patterns  of  all 
the  classes  in  the  sense  that  the  pattern  of  far  more  small  things  than  large  ones 
is  retained.  The  fact  that  the  top  classes  reflect  the  whole  is  the  true  power  of 
head/tail  breaks.  In  other  words,  the  top  classes  retain  the  same  scaling  pattern 
of  far  more  small  things  than  large  ones  of  all  the  classes. 


Natural  cities  from  Tweets  locations 

Like  POI,  social  media  users’  locations  can  be  aggregated  to  form  individual 
natural  cities.  Unlike  POI,  Twitter  users’  geolocations  contain  very  precise  time 
information,  up  to  minutes  or  seconds.  In  this  way,  we  can  slice  the  Tweets 
location  data  minute  by  minute,  hour  by  hour,  in  order  to  track  how  the  natural 
cities  evolve.  The  derivation  of  the  natural  cities  followed  the  same  procedure  in 
the  previous  work  (Jiang  and  Miao  2015),  and  was  based  on  the  fact  that  there 
are  far  more  low- density  areas  than  high- density  ones.  That  is,  we  generated  a 
huge  TIN  for  the  unique  locations  of  Tweets  and  then  split  the  TIN  edges  into 
two  unbalanced  parts:  those  above  the  average  in  the  head,  and  those  below 
the  average  in  the  tail.  Eventually,  those  edges  in  the  tail  are  aggregated  into  the 
thousands  of  natural  cities.  The  procedure  is  a simple  application  of  the  head / 
tail  breaks,  or  that  of  the  head/tail  division  rule  (Jiang  and  Liu  2012).  Let  us 
consider  the  four  snapshots  to  examine  the  underlying  fractal  structure  and 
nonlinear  dynamics  of  the  natural  cities  (Figure  4).  The  evolution  of  the  natu- 
ral cities  shows  little  difference  from  that  of  the  Koch  flake:  the  former  being 
statistically  self- similar,  and  the  latter  being  strictly  self- similar.  Accordingly,  I 
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Figure  3:  (Color  online)  The  natural  cities  derived  from  POI  of  France  (Panels  1 
and  4),  Germany  (Panels  2 and  5),  and  the  UK  (Panels  3 and  6). 

Note:  The  red  patches  indicate  the  natural  cities  or  their  boundaries,  whereas 
the  red  dots  indicate  classified  city  sizes  in  terms  of  the  number  of  POI.  As 
mentioned  in  this  paper,  only  a few  top  classes  based  on  the  head/tail  breaks 
are  shown  for  visual  clarity  The  grey  background  is  the  points  of  interest.  The 
map  scales  are  1:15M. 


claimed  that  social  media  could  act  as  a good  proxy  for  studying  the  evolution 
of  real  cities,  in  order  to  understand  how  they  are  generated  and  evolve  through 
local  and  global  interactions  from  the  bottom  up.  This  insight  could  fundamen- 
tally change  the  ways  we  studied  cities  in  the  small  data  era  of  the  past. 

The  scaling  patterns  appear  at  different  levels  of  geographic  space.  This  is 
the  true  sense  of  ubiquity  of  fractal  geographic  features.  It  appears  at  a country 
level,  a regional  level,  and  a city  level.  Figure  5 presents  an  illustration  of  the 
ubiquity  of  scaling  patterns.  The  large  number  of  natural  cities  derived  from 
Tweets  locations  are  classified  into  six  classes.  I display  only  the  top  4 classes 
for  visual  clarity,  yet  they  reflect  the  pattern  of  the  whole  set  (Panel  0).  The 
enlarged  regions  of  Chicago  and  New  York  (Panels  1 and  2)  clearly  show  that 
there  are  far  more  small  cities  than  large  ones.  At  the  city  level,  I computed  the 
connectivity  of  individual  streets,  and  the  connectivity  (or  the  number  of  other 
streets  intersected)  clearly  shows  a heavy-tailed  distribution.  All  the  streets  are 
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2 minutes  5 minutes 


1 hour  12  hours 


Figure  4:  (Color  online)  The  evolution  of  the  natural  cities  on  the  background 
of  TIN  versus  iteration  of  Koch  flake. 

Note:  A few  large  pieces  become  more  fragmented,  whereas  many  small  pieces 
are  continuously  added.  Eventually  there  are  far  more  small  cities  than  large 
ones.  This  way  of  evolution  looks  very  much  like  that  of  Koch  flake.  The  major 
difference  between  the  natural  cities  and  Koch  flake  is  the  former  being  statisti- 
cally self-similar,  and  the  latter  strictly  self-similar.  The  map  scales  are  1:8.4M. 


therefore  classified  and  visualized  based  on  the  head/tail  breaks.  I invite  the 
reader  to  compare  Figures  2 and  5:  the  former  being  purely  mathematical,  and 
the  latter  geographic;  the  former  being  infinite,  and  the  latter  finite;  the  coun- 
try as  a whole  equivalent  to  the  Mandelbrot  set  as  a whole,  whereas  the  cities 
as  parts  equivalent  to  the  Julia  sets  as  parts;  a city  as  a whole  equivalent  to 
the  Mandelbrot  set  a whole,  whereas  the  streets  as  parts  equivalent  to  the  Julia 
sets  as  parts.  From  the  comparison,  we  see  a nested  or  cascading  structure  for 
both  the  Mandelbrot  set  and  the  geographic  space,  and  importantly  the  shared 
recurring  scaling  pattern  of  far  more  small  things  than  large  ones. 


Further  discussions  on  head/tail  breaks 

Head/tail  breaks  offers  a new,  less  strict  way  of  looking  at  our  surrounding  phe- 
nomena, in  particular  societal  and  organization  phenomena.  A phenomenon 
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Figure  5:  (Color  online)  Ubiquity  of  scaling  patterns  at  different  levels  using 
the  USA  (mainland)  as  an  example. 

Note:  The  largest  55  natural  cities  in  the  top  four  classes  at  the  country  level 
(Panel  0);  there  are  far  more  small  cities  than  large  ones  at  the  regional  level 
(Panels  1 and  2),  and  far  more  less-connected  streets  than  well-connected  ones 
at  the  city  level  (Panels  3 to  6),  with  blue  being  the  least  connected  and  red  the 
most  connected.  The  map  scales  for  the  USA  (Panel  0),  the  regions  (Panels  1, 
and  2),  and  the  cities  (Panels  3-6)  are  respectively  1:60M,  1:6M,  and  1:2M. 


or  structure  is  fractal  if  there  are  far  more  small  things  than  large  ones  in  it. 
The  notion  of  far  more  small  things  than  large  ones  is  not  just  in  terms  of  geo- 
metric properties,  but  for  topological  and  semantic  properties  as  well,  i.e.,  far 
more  unpopular  things  than  popular  ones,  or  far  more  meaningless  things  than 
meaningful  ones.  Consequently,  many  societal  and  organizational  phenomena 
are  fractal,  because  they  tend  to  be  divided  in  an  unbalanced  way,  which  is 
known  as  the  80/20  principle  (Koch  1998).  Head/tail  breaks,  in  particular  the 
nested  rank-size  plots,  provides  a new  interpretation  of  self- similarity.  Con- 
ventionally, self- similarity  refers  to  the  property  that  the  whole  has  the  same 
shape  as  one  or  more  of  its  parts  (Mandelbrot  1982).  Now  the  self- similarity 
can  be  interpreted  by  the  repeated  presence  of  far  more  small  things  than  large 
ones,  or  alternatively,  the  repeated  appearance  of  the  small  head  and  long  tail 
division.  It  is  the  self-similarity  that  makes  visualization  of  city  structure  and 
dynamics  possible. 

Head/tail  breaks  applies  to  data  with  a heavy- tailed  distribution.  The  heavy- 
tailed distribution  includes  power  laws  as  well  as  lognormal  and  exponential 
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distributions.  Strictly  speaking,  the  exponential  distribution  is  excluded  from 
the  heavy-tailed  distribution,  because  its  tail  is  quite  short.  I included  it  in  the 
heavy-tailed  distribution  family  because  for  most  real-world  data,  there  is  a 
minimum  threshold  above  which  data  are  claimed  to  be  power  laws,  lognor- 
mal or  exponential  distributions  (Newman  2005).  However,  while  conduct- 
ing the  head/tail  breaks  for  the  data,  we  consider  all  the  data  values  includ- 
ing those  below  the  minimum  threshold.  Thus,  the  data  with  an  exponential 
distribution  including  those  below  the  minimum  would  be  heavy-tailed,  and 
can  therefore  be  broken  multiple  times  using  the  head/tail  breaks.  It  is  also 
widely  recognized  that  the  bigger  the  data,  the  more  likely  they  are  heavy- 
tailed (Jiang  2015d). 

Fractal  structure,  or  the  recurring  scaling  pattern  of  far  more  small  things 
than  large  ones,  possesses  a new  kind  of  beauty  that  positively  impacts 
human  well-being  (Jiang  and  Sui  2014).  The  new  kind  of  beauty,  initially  dis- 
covered and  defined  by  Christopher  Alexander  (2002),  differs  fundamentally 
from  conventional  wisdom  about  aesthetics,  being  personal  and  subjective. 
The  fractal  beauty  exists  in  deep  structure,  being  objective  and  universal  in 
nature.  In  other  words,  it  is  not  the  surface  colors  but  the  deep  fractal  struc- 
ture that  makes  fractals  beautiful.  This  deep  structure  is  a kind  of  order  that 
exists  not  only  in  nature  but  also  in  what  we  build  and  make  (Alexander 
2002),  not  only  in  science  but  also  in  humanities  and  social  sciences.  The 
beauty,  the  order  revealed  by  head/ tail  breaks,  or  fractal  geometry  in  general, 
cuts  across  multiple  sciences  and  disciplines,  bridging  the  two  cultures  (Snow 
1959)  to  form  the  third  culture.  I believe  that  the  visualization  examples  of 
city  structure  and  dynamics  shown  in  the  paper  embody  some  spirits  of  the 
third  culture. 

Many  natural  and  societal  phenomena  demonstrate  fractal  structure  and 
nonlinear  dynamics  (Mandelbrot  and  Hudson  2004).  To  better  understand  the 
complexity  of  social  structure  and  dynamics,  we  must  rely  on  a range  of  com- 
plexity modeling  tools  such  as  fractal  geometry,  chaos  theory,  and  agent-based 
simulations  (Miller  and  Page  2007)  rather  than  conventional  linear  methods 
such  as  Euclidean  geometry  and  Gaussian  statistics.  We  must  harness  the  large 
amounts  of  data  accumulated  on  social  media  and  the  Internet  for  mining  indi- 
vidual and  collective  behaviors.  As  a timely  response  to  the  challenges  aris- 
ing from  big  data,  the  emerging  field  of  computational  social  science  (Lazer 
et  al.  2009,  Watts  2007)  has  been  fundamentally  transforming  the  conven- 
tional social  sciences  into  a data-  and  computational- intensive  science.  Unlike 
computational  sciences  in  the  twenty  century,  computational  social  science  is 
a product  of  the  twenty-first  century,  and  it  should  be  correctly  interpreted 
as  data-intensive  computational  social  science  in  the  big  data  era.  This  is  the 
same  for  computational  geography,  which  appeared  first  in  the  1990s  (Open- 
shaw  1998),  should  be  characterized  as  computational-  and  data-intensive  in 
the  twenty-first  century  (Jiang  2013b).  In  the  big  data  era,  cartography  faces 
the  same  challenge  of  how  to  efficiently  and  effectively  visualize  the  large 
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amounts  of  crowdsourcing  geographic  information.  I believe  that  recognition 
of  the  fractal  nature  of  maps  and  mapping  (Jiang  2015c)  offers  a way  to  meet 
the  challenge. 


Conclusion 

This  paper  has  developed  an  argument  that  head/ tail  breaks  can  be  used  for 
visualization  of  structure  and  dynamics  of  natural  cities  in  the  big  data  era.  To 
support  the  argument,  I have  developed  several  case  studies  applied  to  natu- 
ral cities  derived  from  crowdsourcing  data.  This  paper  has  also  discussed  how 
head/tail  breaks  leads  to  a new  definition  of  fractals,  helping  improve  our  intui- 
tions for  seeing  fractals  in  nature  and  society.  Throughout  the  paper,  we  have 
seen  both  mathematical  fractals  (such  as  the  Sierpinski  carpet,  Mandelbrot  set, 
and  Koch  flakes)  and  geographic  features  (such  as  the  natural  cities  and  streets) 
share  the  same  recurring  scaling  of  far  more  small  things  than  large  ones.  The 
scaling  property  is  what  drives  the  development  of  head/tail  breaks.  It  is  the 
scaling  property  that  makes  head/tail  breaks  an  efficient  and  effective  visuali- 
zation tool  for  revealing  city  structure  and  dynamics.  The  power  of  head/tail 
breaks  lies  in  its  simplicity:  split  things  around  an  average  into  a few  large  and 
many  small,  respectively  in  the  head  and  the  tail  of  the  nested  rank-size  plots, 
and  recursively  continue  the  splitting  process  in  the  head  until  the  condition  of 
far  more  small  things  than  large  ones  is  violated.  The  simple  head/tail  breaks 
and  its  induced  Ht-index  can  help  even  the  general  public  to  see  a variety  of 
fractals  in  science,  art,  and  society. 

The  notion  of  natural  cities,  as  a product  of  the  big  data  era,  provides  a pow- 
erful tool  to  study  human  activities  on  the  earths  surface,  and  enables  us  to 
develop  new  insights  into  geographic  information  harvested  from  crowdsourc- 
ing data.  Compared  with  conventional  real  cities  that  are  imposed  by  authori- 
ties from  the  top  down,  natural  cities  are  defined  from  the  bottom  up,  and  from 
individual  people  and  their  interactions.  Unlike  real  cities,  natural  cities  can 
be  naturally  and  objectively  derived  and  delineated  from  big  data  such  as  VGI 
and  social  media  data.  This  makes  natural  cities  universally  available  for  the 
entire  world,  i.e.  all  the  natural  cities  in  the  world  rather  than  those  in  some 
countries.  In  this  regard,  big  data  is  probably  not  so  much  about  bigness,  but 
rather  completeness.  Natural  cities  may  fundamentally  change  the  ways  cities 
were  studied. 
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Abstract 

Volunteered  geographic  information  (VGI)  plays  an  increasing  role  in  current 
geodata  provision.  At  the  same  time,  due  to  its  lack  of  structure,  it  is  hard  to 
use  as  meaningful  input  in  software  applications.  In  this  chapter,  we  embark 
upon  the  unstructured  character  of  VGI  and  on  ways  to  enrich  the  structure 
in  order  to  make  it  suitable  for  information  retrieval.  We  describe  the  charac- 
teristics of  semantic  enrichment  and  explain  how  folksonomies  and  ontolo- 
gies play  a role.  We  believe  that  they  represent  different  levels  of  formality 
in  a semantic  reference  space  and  determine  the  richness  of  the  information 
retrieval. 
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Introduction 

Recent  developments  in  personal  computing,  GPS  and  Web  2.0  technologies 
are  enabling  a wide  web  audience  to  actively  contribute  to  geo -information 
through  the  internet.  Information  obtained  in  this  way  - commonly  referred  to 
as  volunteered  geographic  information  (VGI)  - is  often  difficult  to  query  due 
to  several  reasons. 

The  complexity  of  querying  is  rooted  in  the  informal,  unstructured,  heteroge- 
neous nature  of  VGI,  which  is  often  published  without  a description  of  its  con- 
text. Those  issues  are  inherent  to  the  process  by  which  VGI  is  produced,  i.e.  by 
individuals  who  are  in  most  cases  not  concerned  with  the  query  process.  This 
chapter  investigates  how  the  process  of  querying  VGI  can  be  improved  by  seman- 
tically enriching  it  during  its  production  and  after  it  is  published.  The  enrichment 
connects  VGI  to  well-known  concepts  which  are  captured  in  both  informal  struc- 
tures (folksonomies)  and  formal  structures  (ontologies).  A folksonomy  repre- 
sents a particular  domain  through  a set  of  user-generated  tags/topics  of  domain- 
related  information,  whereas  an  ontology  constitutes  a domain  more  rigorously 
through  the  representations  of  logical  relationships  between  concepts  used  in  that 
domain.  In  this  research  we  differentiate  the  semantic  enrichment  along  the  line 
of  informal-formal  conceptualization,  i.e.  evaluating  conceptual  bases  ranging 
between  folksonomy  and  ontology,  supporting  the  enrichment  of  VGI. 

The  main  point  we  want  to  stress  is  that  VGI  implies  further  degrees  of  free- 
dom and  expression  for  the  users,  which  can  enable  new,  different  narratives  in 
collecting,  describing,  and  representing  geographic  information.  At  the  same 
time,  this  intrinsic  diversity  requires  the  creation  of  ‘interfaces  between  VGI 
datasets  and  any  algorithm  aiming  to  analyze  them,  in  order  to  translate  the 
folksonomy  (representing  the  vocabulary  used  in  the  VGI)  into  the  structure 
used  by  a query  algorithm.  This  is  a challenge  in  terms  of  1)  the  ad-hoc  work 
necessary  to  deal  with  the  data  and  2)  the  errors,  misinterpretation,  and  infor- 
mation loss  in  the  translation. 

We  pose  the  following  main  research  question  and  set  the  scene  for  its  dis- 
cussion, but  do  not  claim  to  answer  it  yet  fully:  how  does  varying  the  level  at 
which  a top-down  ontology  is  applied  to  a bottom-up  folksonomy  change  the 
understanding  of  underlying  data,  and  thus  the  ability  of  querying  VGI? 

The  goal  is  to  query  VGI  sources  such  as  Tweets,  commented  photos  and 
news  items  about  the  named  features  they  contain.  Typical  queries  are 

• what  is  the  location  of  a feature  named  X? 

• what  is  the  footprint  of  a feature  named  X? 

• what  are  the  features  located  at  or  near  P? 

• what  are  the  features  with  type  T? 

In  some  cases  this  involves  the  harvesting  of  implicit  geographic  information 
(see  also  Kessler  et  al.  2009)  and  in  other  cases  such  information  cannot  be 
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directly  extracted  from  the  VGI  itself  and  it  needs  to  enriched  with  more  for- 
mally structured  information  obtained  from  related  sources,  such  as  Wikipe- 
dia, OpenStreetMap,  Geonames,  etc.  (see  Smart  et  al.  2010). 


Terminology  of  semi- structured  data 

VGI  may  appear  as  structured,  semi- structured  and  unstructured  data.  Kitchin 
(2014)  defines  structured  data  as  data  ‘that  can  be  easily  organized,  stored  and 
transferred  in  a defined  data  model’,  thus  encompassing  all  data  that  can  be 
represented  and  dealt  with  using  relational  databases  and  other  technologies 
or  representational  models  such  as  object-oriented  languages  or  description 
logics.  As  a result,  this  kind  of  data  can  be  straightforwardly  processed  through 
algorithms  and  visualized  using  graphs  and  maps.  By  contrast,  semi- structured 
data  don’t  have  a,  rigid,  regular,  or  complete  predefined  data  model/schema  as 
required  by  traditional  databases  (Abiteboul  1997),  though  having  ‘a  reasonably 
consistent  set  of  fields’  (Kitchin  2014).  These  include  content  that  could  barely 
be  coded  in  a relational  database,  while  still  being  characterized  by  irregular 
and  flexible  structures.  Abiteboul  (1997)  provides  a clear  explanation  of  how 
HTML  pages  are  a good  example  of  semi- structured  data,  due  to  their  lack  of 
uniformity,  and  ample  use  of  plain  text.  Finally,  data  is  defined  as  unstructured 
if  it  has  no  structure  that  can  be  identifiable  as  common  for  the  whole  dataset, 
despite  each  element  of  the  same  dataset  might  have  its  own  internal  structure, 
which  is  not  shared  by  any  other  element. 

On  such  basis,  most  VGI  content  would  be  classified  as  semi-structured  data, 
as  few  datasets  could  be  straightforwardly  dealt  with  in  a relational  database. 
Instead  VGI  datasets  commonly  use  loose  data  definitions  and  categorizations, 
which  are  flexible  and  constantly  edited  by  the  same  users,  as  well  as  more 
suitable  to  describe  large  quantities  of  vague  information.  A good  example  of 
loose  categorization  of  geographic  data  can  be  found  amongst  the  OpenStreet- 
Map (OSM)  map  features  — which  include  over  eight  thousand  different  user- 
defined  kind  of  shops. 

Most  VGI  content  would  also  fit  in  Kitchin’s  (2014)  definition  of ‘captured’ 
data‘,  that  is  data  that  has  been  directly  captured  through  some  device  with  the 
specific  intention  of  capture  the  data.  However,  it  might  be  argued  that  geo- 
tagged  information,  such  as  photos,  entail  ‘exhaust’  geographic  data  (implic- 
itly included  geodata)  in  the  form  of  GPS  coordinates  in  the  image  header — 
that  is,  as  byproduct  of  capturing  the  photo,  but  not  as  main  outcome  of  the 
process. 


Characterizing  the  heterogeneity  of  VGI 


VGI  is  very  heterogeneous  and  diverse,  due  to  three  major  reasons. 
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First,  geographic  features  are  very  heterogeneous,  since  there  are  far  more 
small  geographic  features  than  large  ones.  Using  a more  scientific  terminology, 
geographic  features  are  fractal  or  scaling  (Jiang  & Yin  2014),  and  they  are  best 
characterized  by  some  heavy  tailed  (Zipf  1949)  rather  than  Gaussian-like  dis- 
tributions. There  are,  for  example,  far  more  small  street  blocks  than  large  ones. 
The  small  street  blocks  can  be  named  as  city  blocks  in  cities,  while  the  big  street 
blocks  are  called  field  blocks  in  the  countryside.  The  small  street  blocks  consti- 
tute cities  or  natural  cities  to  be  more  precise,  whereas  the  large  street  blocks 
collectively  form  the  countryside. 

The  heterogeneity  of  OSM  data  can  be  examined  from  various  aspects  such 
as  element  sizes,  the  number  of  edits,  and  the  number  of  users  for  each  element 
(Ma  et  al.  2015).  For  example,  the  element  size  ranges  from  3 up  to  5,000,000, 
the  number  of  edits  for  each  element  can  go  up  from  1 to  2,000.  It  is  the  het- 
erogeneity that  makes  VGI  unique  and  powerful  in  comparison  to  authori- 
tative geographic  information.  It  is  the  heterogeneity  that  makes  VGI  differ 
fundamentally  from  small  data.  It  is  the  heterogeneity  that  makes  researching 
VGI  interesting  and  exciting.  We  should  go  beyond  small  data  thinking  such  as 
Gaussian  distributions  and  Euclidean  geometry,  and  towards  big  data  thinking 
such  as  heavy  tailed  distributions  and  fractal  geometry. 

Second,  VGI  can  be  produced  through  different  methods  and  technologies, 
implying  different  levels  of  structural  rigidity,  ranging  from  menu  entries  to 
free  text  entries.  This  has  important  implications  for  semantic  querying  (see 
Section  6).  The  same  VGI  dataset  may  contain  structured,  semi-structured 
and  unstructured  data.  For  instance,  the  geometric  part  of  OSM  or  the  data / 
time  metadata  of  Twitter  are  structured,  as  there  is  a fixed  schema  for  them; 
additional  information  about  geographic  features  in  OSM  consists  of  semi- 
structured  sets  of  tag-value  pairs;  and  the  content  of  some  fields  are  unstruc- 
tured texts.  In  OSM,  tags  can  be  freely  created  by  the  user  and  the  way  people 
assign  a geometry  to  a feature  is  not  always  consistent  throughout  the  project 
(with  different  scales  typically  (Touya  & Brando  2013)). 

Third,  VGI  contributors  may  come  from  very  diverse  geographic,  cultural, 
and  technical  backgrounds,  and  thus  might  be  accustomed  with  different  ter- 
minologies, or  have  different  narratives.  Some  VGI  is  produced  with  a shared 
conceptualization  that  can  be  a set  of  tags  or  a category  graph  (like  OSM  or 
DBpedia),  yet  the  production  of  data  with  this  conceptualization  in  mind 
is  done  differently  depending  on  contributors  (Brando  & Bucher  2010).  An 
example  is  mapping  of  crimes,  where  people  can  interpret  the  levels  of  violence 
differently.  Besides,  sets  of  tags  evolve  over  time.  Hence,  if  data  have  not  been 
tagged  with  a specific  tag,  it  might  just  be  due  to  the  fact  that  that  tag  did  not 
exist  when  they  were  produced. 

The  heterogeneity  of  geographic  features,  modes  of  production,  and  contrib- 
utors background  are  all  contributing  to  the  fact  that  the  quality  of  VGI  is  often 
disputed  and  that  even  the  quality  itself  is  heterogeneous. 
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Folksonomies  and  Ontologies  for  querying  VGI 

Writing  an  algorithm  to  perform  a task  on  a given  data  source,  or  querying  this 
source,  can  be  better  accomplished  if  the  meaning  of  each  element  of  the  source 
is  well  defined.  In  traditional  structured  sources  this  meaning  is  conveyed  by  a 
database  schema  or  a datatype  definition  expressed  in  a database  or  program- 
ming language.  In  VGI  the  situation  is  different  because  1)  the  schema,  if  it 
exists,  may  not  be  sufficient,  due  to  various  interpretations  by  the  users,  and  2) 
many  VGI  sources  are  only  semi-structured,  without  any  centrally  defined 
schema.  Therefore  it  is  necessary  to  rely  on  some  semantic  resource  to  repre- 
sent the  meanings  of  the  data  elements. 

There  are  several  types  of  such  semantic  resources,  ranging  from  the  most 
informal  (folksonomies  or  glossaries)  to  the  most  formal  (formal  logical  ontol- 
ogies). These  resources,  generally  known  as  knowledge  organization  systems, 
can  be  characterized  along  two  axes:  1)  the  structure  complexity  of  the  underly- 
ing data  (tags,  classes,  hierarchical  relations,  etc.)  and  2)  the  formalism  used  to 
express  concept  definitions.  Figure  1 presents  a classification,  along  these  axes, 
of  the  most  frequently  used  knowledge  organization  systems. 


Semantic  enrichment  of  VGI 

Semantic  enrichment  refers  to  the  process  of  making  information  more  mean- 
ingful by  adding  explicit  structure,  metadata,  definitions,  etc.  Explicit  means 
that  the  result  is  queryable. 
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Figure  1:  Informal  and  formal  semantic  reference  space. 
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Figure  2:  Semantic  enrichment  process. 


The  result  of  the  enriching  process  obviously  relies  on  the  syntax  and  seman- 
tics of  the  target  information  and  the  enriching  information  source  and  the 
way  in  which  the  enrichment  is  performed  (see  Figure  2).  We  highlight  three 
aspects  which  are  crucial  for  a successful  usage  of  the  enriching  process:  1)  The 
semantics  of  the  enriching  source,  2)  The  semantics  of  the  enriching  informa- 
tion and  3)  The  syntax  of  the  enriching  information. 

Targets  can  be  free  text  in  which  case  grammar  rules  provide  the  enrich- 
ing information  source  and  semantic  enrichment  is  done  through  natural  lan- 
guage processing  (NLP)  (see  for  example  Penas  & Hovy  2010).  In  some  cases 
the  enriching  information  is  intrinsically  held  in  the  target  itself,  such  as  rela- 
tionships between  items  in  a photo  databases,  such  as  Flickr.  In  such  cases  co- 
occurrence and  data  mining  methods  can  be  used  to  make  these  knowledge 
explicit  (Deng  et  al.  2009). 

In  other  cases  the  enriching  information  source  is  constituted  by  other 
sources,  such  as 

1)  (for  exhaust-like  data):  web  resources,  sensors,  gazetteer  information  (see 
Graham  & De  Sabbata  2015),  etc. 

2)  (for  captured  data):  shared  guidelines’,  each  capturer’s  skills  and  inten- 
tion, tasks  assignment  between  several  capturers,  and  the  capturers’  abili- 
ties to  work  together. 

The  second  category  can  be  captured  by  context  models  and  provenance  (back- 
ground on  how  the  information  was  produced)  as  reported  in  (Abel  et  al.  2012). 
The  enriching  information  appears  itself  in  different  forms,  for  example  as  an 
ontology  (see  Lacasta  et  al.  2012). 

In  geospatial  applications  the  semantic  aspects  of  space  put  an  extra  con- 
straint on  semantic  enrichment.  Ballatore  et  al.  (2011)  combine  a semanti- 
cally-rich  and  spatially-poor  ontology  (DBpedia)  with  a spatially- rich  and 
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semantically-poor  VGI  dataset  (OpenStreetMap)  to  facilitate  spatial  knowledge 
discovery.  As  geo -information  is  so  often  constructed  through  multi-function 
workflows,  provenance  plays  an  important  role  in  understanding  the  created 
geo -information.  In  addition,  in  VGI  projects  we  think  it  is  relevant  to  capture 
what  people  intended  to  do  with  the  VGI  at  hand. 

The  result  of  the  semantic  enrichment  can  range  from  an  ontology  to  more 
light-weight  schema  elements.  Such  enriching  information  can  exist  as  separate 
entities  relating  to  the  target  information  or  can  be  embedded  within  the  target 
as  metadata.  As  such  they  provide  a more  meaningful  view  on  the  target  data 
and  the  basis  for  more  meaningful  queries,  as  described  in  the  next  section. 


Towards  semantic  queries 

In  this  section  we  show  the  different  uses  of  semantic  enrichment  when  que- 
rying a VGI  source.  For  each  structure  level  (structured,  semi- structured, 
unstructured)  we  study  what  can  be  done  with  and  without  semantic  enrich- 
ment: what  are  the  problems  and  limitations  with  ‘direct’  queries  on  the  VGI 
source  and  how  different  types  of  enrichment  can  help. 

The  main  problem  that  arises  when  querying  the  structured  part  of  VGI  lies 
certainly  in  the  differences  in  terms  of  quality  and  semantics  that  occur  in  the 
attribute  values.  E.g.  a time  value  may  be  expressed  in  local  time  or  in  UTC, 
a length  with  different  units  of  measurement,  etc.  These  variations  may  ulti- 
mately render  query  results  very  imprecise  or  even  meaningless. 

The  semantic  enrichment  of  structured  data  may  connect  structural  elements 
(table  and  attribute  names)  or  data  elements  (attribute  values)  to  semantic  enti- 
ties in  some  knowledge  organization  system. 

The  semantic  enrichment  of  structural  elements  can  be  exploited  by  meta- 
level queries  that  help  build  correct  queries.  For  instance:  Find  the  tables  and 
attributes  that  hold  information  about  employment  rates.  This  is  particularly 
useful  when  the  database  schema  is  large  and  complex.  Any  type  of  knowledge 
organization  system  can  be  used  for  this  purpose.  In  a geographical  context, 
enrichment  can  be  done  for  example  by  making  geographic  properties  explicit, 
e.g.  that  a bridge  is  part  of  a road,  ‘built-up  area  is  an  aggregate  of  building 
features,  etc. 

If  the  enrichment  is  done  with  a formal  logical  ontology,  it  becomes  possible 
to  express  deductive  queries  (such  as  in  the  programming  language  Datalog) 
that  can  produce  results  not  computable  with  standard  SQL  queries. 

The  semantic  enrichment  at  the  data  level  consists  in  associating  attrib- 
ute values  to  descriptors  that  make  them  meaningful  (units,  scale,  accuracy, 
etc).  These  descriptions  can  then  be  used  to  augment  the  queries  with  selec- 
tion criteria  or  transformation  functions  to  produce  higher  quality  results 
(e.g.  select  only  those  data  that  have  a sufficient  accuracy  and  a given  unit  of 
measurement). 
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Since  VGI  sources  essentially  link  data  to  geographic  entities,  a natural  enrich- 
ment consists  in  annotating  the  data  elements  to  entities  in  some  geographic 
knowledge  source,  such  as  Geonames  or  a geographic  ontology.  This  will  allow 
for  semantic  queries  that  combine  geographic  knowledge  and  other  data. 

Semi-structured  VGI  presents  additional  types  of  problems.  Since  the  schema 
is  generally  not  controlled,  users  can  create  multiple  structures  to  represent  the 
same  real  world  phenomenon.  For  instance,  the  DBpedia  database  has  at  least 
five  different  properties  to  describe  a persons  birthplace,  even  though  these 
names  are  obtained  from  supposedly  structured  ‘infoboxes  of  Wikipedia.  This 
leads  to  complex  queries  in  which  all  the  possible  property  and  value  names 
must  appear,  e.g.  {?x  ex:placeOfBirth  ?p}  union  {?x  ex:birthplace  ?p}  union  {?x 
birthplace  ?p}  in  a SPARQL  query.  The  problem  is  of  course  worse  with  sources 
in  which  users  regularly  define  new  attribute  names  and  values,  as  is  the  case 
in  OpenStreetMap.  In  some  situations  it  may  become  almost  impossible  to 
express  consistent  and  complete  queries. 

For  querying  semi-structured  VGI,  the  role  of  the  semantic  enrichment  is 
essentially  to  describe  and  unify  (or  differentiate)  the  multiple  naming  schemes 
produced  by  the  users.  If  the  names  used  in  the  VGI  source  are  associated  to 
corresponding  entities  in  an  ontology,  then  the  ontology’s  vocabulary  can  be 
used  to  express  unified’  queries  that  can  be  automatically  rewritten  into  the 
VGI’s  vocabulary  to  produce  a (complex)  query  on  the  VGI  source. 

In  the  case  of  OpenStreetMap,  many  tags  may  designate  the  same  concept 
and  a single  tag  may  designate  different  concepts  in  different  contexts.  For 
instance,  the  semantic  query  roadType  motorway  will  return  features  (roads) 
tagged  with  roadCategory  motorway,  roadCategory  highway,  roadCategory 
turnpike,  category,  turnpike,  type  motorway,  etc.  And  in  case  the  formal  lan- 
guage models  more  relationship  types,  it  will  also  indicate  related  features  such 
as  bridges,  traffic  lights,  etc. 

The  abstraction  level  of  the  ontology  used  for  the  enrichment  will  determine 
the  level  of  (semantic)  detail  of  the  queries.  A high  level  ontology  will  unify  many 
different  names  of  the  VGI  into  a single  high  level  concept,  while  more  precise 
(domain  specific)  ontologies  will  enable  queries  that  are  closer  to  the  VGI’s  level 
of  granularity.  A similar  remark  applies  to  the  geographic  axis.  The  geographic 
precision  of  the  results  will  depend  on  the  scale  and  precision  the  geographic 
ontology  that  is  used  to  enrich  the  VGI  source.  Moreover,  if  the  enrichment 
structure  possesses  a rich  semantic  structure,  with  subclass  relations  or  more 
sophisticated  axioms,  it  will  support  a more  expressive  query  rewriting.  For 
instance,  a query  about  Artists  could  be  rewritten  as  a query  about  its  subclasses 
Painters,  Sculptors,  Musicians,  etc.  if  the  Artist  concept  is  not  directly  repre- 
sented in  the  VGI  source.  In  more  than  one  case  one  should  combine  several 
ontologies  to  support  the  queries  (see  for  example  Lemmens  & Kessler  2014). 

In  semantic  enrichment  a basic  effort  consists  of  associating  texts  with  the 
concepts  they  deal  with,  this  is  generally  accomplished  with  techniques  such 
as  word  sense  disambiguation  and  named  entity  recognition  (in  particular 
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geographic  entity  recognition).  In  this  effort,  one  has  to  implement  methods 
that  are  able  to  deal  with  the  vagueness  of  information.  With  this  kind  of  enrich- 
ment, semantic  queries  can  answer  questions  such  as  ‘find  the  data  elements 
(texts)  about  the  concept  X’.  Higher  levels  of  enrichment  consist  in  extracting 
precise  information  (facts)  from  texts.  This  amounts  to  transform  unstructured 
sources  into  (semi-)structured  ones,  which  is  still  an  open  research  challenge. 


Conclusions  and  recommendations 

The  semi- structured  nature  of  VGI  causes  without  doubt  problems  in  the  que- 
rying of  its  contents.  We  have  presented  several  ways  of  imposing  structure  in 
order  to  facilitate  more  meaningful  queries.  Even  the  most  basic  queries,  which 
go  beyond  text  search,  rely  on  some  kind  of  structure.  Whether  the  right  degree 
of  structure  can  be  created  depends  on  the  success  of  the  semantic  enrichment 
process.  In  case  of  VGI,  there  are  a variety  of  options,  for  which  some  of  them 
rely  on  the  reference  to  geodatabases.  Semantic  enrichment  is  basically  consti- 
tuted by  linking  the  VGI  to  ontological  concepts  and  their  relationships. 

We  believe  that  some  of  the  enriching  information  sources  need  curation 
as  they  are  often  ambiguous  themselves.  The  level  of  enrichment  needed,  for 
which  the  semantic  reference  space  is  positioned  between  folksonomy  and  for- 
mal ontology,  depends  on  the  type  of  queries  and  needs  to  be  further  investi- 
gated with  practical  use  cases. 
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Abstract 

With  millions  of  users  worldwide,  crowd-sourced  social  media  data  provide 
a valuable  data  source  for  events  happening  around  the  world.  More  specifi- 
cally, microblogs,  which  are  social  networks  that  enforce  short  text  messages, 
have  a high  popularity  due  to  their  availability  as  a mobile  application  and  the 
practicality  of  short  messages.  Estimating  the  location  of  the  events  detected 
by  following  posts  in  microblogs  have  been  the  motivation  of  numerous  recent 
studies.  Extracting  the  location  information  and  estimating  the  event  location 
is  a challenging  task  to  maintain  satisfactory  situation  awareness,  especially  for 
emergency  cases  such  as  fire  or  traffic  accidents.  Today,  Twitter  is  among  the 
most  popular  microblogging  platforms,  and  there  are  recent  research  efforts 
aimed  at  detection  of  novel  events  online  by  following  the  Tweets.  In  order 
to  analyze  events,  researchers  generally  focus  on  spatio-temporal  features  of 
the  posts.  Temporal  features  denote  the  time  and  ordering  of  posts,  whereas 
spatial  features  are  useful  for  location  extraction  or  estimation.  In  this  work, 
we  present  an  overview  on  the  process  for  toponym  recognition  and  location 
estimation  from  microblogs. 
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Introduction 

Social  media  platforms  constitute  a rich  and  up-to-date  information  resource 
with  their  rapidly  increasing  number  of  users.  With  its  popular  features,  includ- 
ing the  text  message  limit  and  easy  access  on  mobile  environments,  microblog- 
ging platforms  in  particular  are  intensively  used  by  a large  number  of  users. 
Among  such  platforms,  Twitter  appears  as  the  most  popular  service  with  a 
high  number  of  users.  Since  most  of  the  users  allow  public  access  to  posts  and 
user  profile  information,  it  provides  rich  data  for  research  areas  including  text 
analysis,  text  mining,  information  extraction  or  social  analysis. 

Event  extraction  and  location  estimation  from  crowd-sourced  data  in  social 
networks  are  important  for  learning  about  the  events  happening.  These  tech- 
niques have  applications  in  various  domains  involving  being  aware  of  what 
is  happening  in  the  vicinity  and  rapid  access  to  emergency  information.  As  a 
potential  use  case,  for  instance,  on  the  basis  of  the  Twitter  posts  of  users  wit- 
nessing a major  accident  in  which  a chain  of  cars  are  involved,  it  may  be  possi- 
ble to  estimate  the  event  location,  open  alternative  routes  and  provide  first  aid. 
Similarly,  for  management  of  disasters  such  as  earthquakes  or  floods,  accessing 
such  information  fast  is  valuable. 

An  important  task  to  be  fulfilled  for  location  estimation  is  extracting  evidence 
about  location,  especially  recognizing  location  names,  i.e.  toponyms,  within  a 
social  network  message.  This  problem  is  a specific  sub-problem  under  named 
entity  recognition  (NER)  task,  which  is  a well-known  natural  language  process- 
ing (NLP)  problem  aimed  at  extracting  certain  types  of  entities  including  peo- 
ple, organizations,  dates,  locations  etc.  from  text.  NER  techniques  proposed  in 
the  literature  mostly  work  on  long  and  formal  texts,  such  as  newspaper  articles. 
In  long  and  formal  texts,  literature  provides  established  NER  solutions.  On  the 
other  hand,  short  and  informal  texts  obtained  in  microblogs  pose  important 
challenges.  Short  text  limits  the  contextual  information  that  can  be  obtained 
from  the  whole  text,  whereas  informal  language  makes  it  difficult  to  recognize 
the  words.  In  this  paper,  we  particularly  concentrate  on  toponym  recognition 
in  microblog  messages  under  these  challenges. 

Once  toponyms  are  recognized  in  microblog  posts,  they  provide  useful  and 
rich  input  for  location  estimation  tasks.  However,  this  step  includes  several 
challenges  as  well.  One  important  challenge  is  that,  in  addition  to  recognized 
toponyms,  there  may  be  other  clues  to  the  location  of  the  event,  such  as  the  GPS 
annotation  of  the  message  provided  by  the  mobile  device.  It  is  necessary  to  make 
use  of  these  clues  in  a complementary  manner.  Another  basic  challenge  is  the 
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existence  of  several  contradictory  toponyms  or  clues,  such  as  having  location 
names  that  point  to  different  coordinates.  It  is  necessary  to  devise  a mechanism 
in  order  to  weigh  how  well  conflicting  clues  contribute  to  the  location  estimation. 

It  is  important  to  note  that,  although  the  proposed  techniques  in  the  literature 
are  generally  demonstrated  on  Twitter  posts,  i.e.  Tweets,  they  can  be  applied  on 
other  social  media  data,  especially  those  having  informal  use  of  language. 

In  this  work,  we  describe  these  steps  in  an  overview.  The  organization  of  this 
paper  is  as  follows.  In  the  next  section,  toponym  recognition  in  informal  text 
is  described.  In  Section  3,  we  describe  the  location  estimation  problem  and  the 
suggested  solutions  in  the  literature,  and  conclude  the  paper  with  a summary 
and  overview  in  Section  4. 


Toponym  Recognition 

As  toponym , referring  to  location  names,  is  a type  of  named  entity,  for  toponym 
recognition,  generally,  techniques  for  Named  Entity  Recognition  (NER)  are 
used.  However,  the  proposed  solutions  conventionally  work  on  formal  texts. 
Recently,  with  the  increasing  amount  of  potentially  rich  data  from  social  net- 
works, NER  and  toponym  recognition  in  web  resources  has  attracted  atten- 
tion. However,  solutions  on  formal  texts  rely  on  very  basic  features  such  as 
capitalization  of  the  first  letter  of  a token,  existence  of  an  apostrophe  character 
within  the  token  or  existence  of  the  token  in  the  gazetteer.  Informal  and  non- 
standardized  language  use  in  social  networks  poses  a challenge  in  applying  the 
same  techniques  in  this  area.  Therefore  new  NER  and  toponym  recognition 
techniques  are  being  proposed. 

In  the  literature  of  toponym  recognition,  the  proposed  techniques  can  be 
categorized  as: 

• gazetteer-based, 

• rule-based,  and 

• machine  learning  based  approaches. 

The  first  step  in  all  these  techniques  is  tokenizing  the  messages  into  words. 
In  addition,  morphological  analysis  and  Part-of- Speech  (POS)  tagging  may 
be  employed  as  preprocessing  steps  if  morphemes/suffixes  and  POS  tags  are 
utilized  in  the  toponym  recognition  process.  Morphological  analysis  enables 
decomposition  of  a word  into  its  affixes  and  the  stem.  POS  tag  of  a word  and 
suffixes  in  a word  are  effective  features  used  in  NER  systems  (StanfordNLP), 
(Seker  2012).  For  example,  for  the  word  ‘happily’,  a morphological  analyzer  for 
English  will  show  that  the  stem  of  the  word  is  ‘happy’  and  having  the  suffix  ‘-ly\ 
An  English  POS  tagger  will  annotate  this  word  as  an  adverb.  For  stemming  and 
POS  tagging,  language  specific  morphological  analysis  tools  are  needed.  For 
English,  one  of  the  well-known  tools  is  the  NLP  library  of  Stanford  NLP  Group 
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(StanfordNLP).  For  example,  for  Turkish  texts,  the  morphological  analysis  tool 
Zemberek  (Zemberek  2015)  is  commonly  used. 

Another  important  preprocessing  step  is  normalization.  Due  to  the  informal 
use  of  language,  text  may  include  spelling  errors  or  unusual  abbreviations.  In 
normalization,  such  problems  are  fixed  before  applying  the  toponym  recogni- 
tion technique.  In  Turkish,  one  of  the  available  tools  for  normalization  is  being 
developed  by  ITU  NLP  group  (Eryigit  2014).  This  tool  fixes  some  of  the  spell- 
ing errors  and  performs  capitalization  for  some  proper  nouns.  Current  solu- 
tions for  normalization  mostly  include  rule -based  corrections  capturing  previ- 
ously known  informal  language  patterns,  such  as  repetition  of  characters  for 
emphasizing  emotion.  For  example,  'Soooooo  cooooooool !!!!’  is  such  a message 
that  can  be  frequently  used  in  microblogs.  Normalization  process  should  be 
able  to  convert  the  words  to  'So’  and  cool’. 


Gazetteer-based  Approach 

In  this  approach,  a predefined  list  of  location  names  is  used  as  the  gazetteer. 
The  recognition  process  basically  relies  on  checking  whether  a given  token  is  in 
the  gazetteer.  The  content  and  the  granularity  of  the  list  depend  on  the  context 
of  the  toponym  recognition  application.  For  general-purpose  solutions,  the  list 
may  contain  country,  city,  town  or  Point  of  Interest  (POI)  names.  For  a specific 
geographical  region,  this  list  may  be  more  detailed,  including  hospitals,  banks, 
pharmacies  etc.  depending  on  the  context  of  the  application.  For  general-pur- 
pose gazetteers,  OpenStreetMap  (OpenStreetMap  2015)  and  Wikipedia  (Wiki- 
pedia 2015)  are  commonly  used  resources. 

In  this  approach,  toponym  recognition  is  based  on  checking  whether  text 
includes  any  toponym  from  the  gazetteer.  To  this  end,  very  simply,  each  token 
in  the  text  is  looked  up  in  the  gazetteer.  The  ones  that  are  found  are  marked  as 
toponyms.  For  example,  for  the  post  'Enjoying  good  weather  in  Bebek  Park,  in 
Istanbul ’,  with  a general-purpose  gazetteer  that  includes  POIs  as  well  as  city 
names,  it  is  possible  to  recognize  the  toponyms  Bebek  Park  and  Istanbul.  On  the 
other  hand,  with  a more  limited  gazetteer  of  city  names,  only  Istanbul  will  be 
recognized.  In  this  approach,  normalization  is  more  crucial,  since  it  is  based  on 
matching  between  the  token  at  hand  and  the  toponym  in  the  gazetteer. 


Rule-based  Approach 

In  this  approach,  certain  patterns  are  defined  in  the  form  of  the  rules  in  order  to 
recognize  toponyms.  For  instance,  if  the  word  street  follows  a token,  the  token 
is  considered  to  be  a toponym  referring  to  a street  name  (such  as  ‘ Oak  Street’). 
Some  other  patterns  rely  on  the  morphology  or  the  POS  tag  of  the  word.  One 
basic  pattern  for  English  is  that  if  a token  is  preceeded  by  a pronoun,  it  is  likely 
to  be  a toponym  (such  as  'in  Istanbul’).  However,  such  rules  overestimate  the 
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toponyms,  leading  to  false  positive  recognition  for  the  phrases  such  as  con  the 
table.  Therefore,  they  may  have  high  recall  performance,  however  precision  per- 
formance, generally,  is  not  high.  A recent  study  conducted  on  Turkish  Tweets  has 
shown  that  POS  tag/morphology  based  rules  achieved  30%  precision,  whereas 
with  the  gazetteer-based  approach  67%  precision  is  obtained  (Onal  2014). 


Machine  Learning  based  Approach 

In  the  machine  learning  based  approach,  supervised  learning  is  the  most  com- 
mon sub-approach  employed  for  toponym  recognition.  Given  a set  of  texts  in 
which  toponyms  are  annotated,  a toponym  model  is  constructed.  More  specifi- 
cally, Conditional  Random  Field  (CRF)  is  the  most  commonly  used  supervised 
learning  technique  providing  satisfactory  recognition  ratios  for  formal  texts.  The 
advantage  of  CRF  is  that  it  is  possible  to  capture  contextual  information  in  the 
model,  such  as  having  words  in  the  neighborhood  of  a toponym.  For  Turkish 
texts,  in  Seker  (2012),  a CRF-based  NER  solution  is  presented.  However,  the  suc- 
cess of  this  approach  on  informal  texts  heavily  relies  on  the  normalization  step. 

In  a more  recent  study  (Sagcan  2014),  a CRF-based  solution  that  focuses  on 
toponym  recognition  in  microblogging  messages  is  proposed.  The  main  moti- 
vation behind  the  work  is  performing  toponym  recognition  without  using  any 
gazetteer  and  with  less  preprocessing  effort  for  normalization.  The  main  archi- 
tecture of  this  approach  is  given  in  Figure  1. 

As  seen  in  the  figure,  the  architecture  is  composed  of  two  parts.  In  the  first 
part,  data  preparation  is  performed.  The  second  part  contains  CRF-based 
learning  modules. 

In  the  data  preparation  phase,  initially,  conventional  tokenization  and  mor- 
phological analysis  operations  are  applied.  Afterwards,  two  simple  normaliza- 
tion steps  are  applied.  In  the  first  normalization  step,  repeated  characters  are 
eliminated.  For  instance,  900000k  giizeeF  (90k  giizel  (Turkish)  = very  good 
(English)),  gooooool’  (gol  (Turkish)  = goal  (English))  are  commonly  seen  mis- 
spellings in  Turkish  Tweets.  Such  character  repetitions  occur  in  all  languages  in 
Twitter  to  denote  an  emphasis  or  exaggeration  of  emotion.  The  second  normali- 
zation step  is  for  the  cases  in  which  the  English  alphabet  is  used  instead  of  the 
Turkish  alphabet  characters.  Most  of  the  misspellings  in  Turkish  Tweets  origi- 
nate from  replacement  of  Turkish  characters  with  diacritics  (9,  g,  1,  0,  §,  u)  with 
non-accentuated  characters  (c,  g , i,  0,  s,  u).  Such  usage  is  also  very  common  in 
other  non-English  microblogging  posts.  This  step  can  be  customized  according 
to  the  language  alphabet  under  interest.  In  the  literature,  the  proposed  normali- 
zation step  involves  intensive  cleaning  and  pre-processing  on  the  text.  In  Sagcan 
(2014),  only  two  simple  normalization  steps  are  applied  and  the  other  informal 
language  problems  are  expected  to  be  resolved  during  the  learning  phase. 

In  the  second  part  of  the  architecture,  for  CRF-based  learning,  an  annotated 
training  data  set  is  prepared  in  order  to  be  used  for  model  construction.  In  this 
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Figure  1:  The  architecture  for  CRF-based  Toponym  Recognition  (Sagcan  2014). 


step,  the  important  issue  is  which  attributes  to  use  in  the  term  vector  repre- 
senting the  text  message.  Within  the  study,  toponym  recognition  performance 
under  several  attribute  compositions  is  analyzed.  Under  the  best  attribute  com- 
position, the  precision  performance  is  reported  as  88%. 

Location  Estimation 

Studies  on  location  estimation  are  commonly  related  with  event  detection 
efforts.  Recently,  numerous  studies  have  been  conducted  to  detect  real-world 
events  by  collecting  public  Tweets  in  Twitter.  These  include  methods  to  identify 
various  types  of  events  including  earthquakes  (Sakaki  2010,  2013),  disasters 
and  crises  (Yin  2012),  and  accidents  (Rui  2012).  The  location  estimation  tech- 
niques proposed  in  these  studies  use  different  data  and  clues  extracted  from 
Tweets.  In  this  section,  we  revise  the  most  basic  location  estimation  techniques 
according  to  the  data  used. 
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Geo-Annotation  based  Approach 

In  most  of  these  event  detection  studies,  GPS  metadata  of  geo-tagged  Tweets 
are  the  main  resources  for  estimating  the  locations  of  events.  A very  simple 
procedure  is  followed.  Once  the  location  is  estimated,  location  information 
about  the  detected  events  is  presented  on  the  map  by  plotting  the  geo -tagged 
Tweets  about  the  event. 


User  Profile  based  Approach 

In  several  other  studies,  the  textual  content  in  the  location  attribute  of  user 
profiles  has  been  utilized  in  the  place  of  GPS  data  for  non-geo -tagged  Tweets 
in,  for  example,  Sakaki  (2010,  2013)  and  Yin  (2012).  However,  an  in-depth 
study  of  user  profiles  show  that  since  this  attribute  is  a free -text  field  limited 
to  30  characters,  it  may  contain  multiple  location  names  or  even  non-existing 
locations.  According  to  the  observation  of  Rui  (2012),  only  12%  of  users 
specified  a location  in  their  profile.  Hence  the  authors  tried  to  predict  the  user 
locations  by  analyzing  their  previous  Tweets  and  locations  of  their  friends. 

Toponym  based  Approach 

Due  to  the  scarcity  of  geo-tagged  Tweets  and  location  information  in  user  pro- 
files, in  Middleton  et  al.  (2014),  researchers  focused  on  analyzing  the  Tweet  con- 
tent for  location  references  by  implementing  a geo-parser.  Location  estimation 
studies  concentrating  on  Tweet  content  make  use  of  the  toponym  recognition 
techniques  as  discussed  in  Section  2.  For  example,  in  Sankaranarayanan  et  al. 
(2009),  the  authors  used  a gazetteer  for  toponym  recognition  and  described  a 
heuristic  for  toponym  resolution  in  Tweets.  Within  the  scope  of  event  detection, 
some  of  the  studies  concentrate  on  estimating  the  location  by  using  the  whole 
cluster  of  Tweets  that  correspond  to  an  event.  The  most  frequently  mentioned 
toponym  in  Tweet  contents  and  user  profiles  is  designated  as  the  event  location. 

Evidence  Combination  based  Approach 

In  Ozdikis  (2013)  a Bayesian  approach  is  employed  that  uses  evidences  from 
several  resources  for  location  estimation.  To  this  aim,  three  resources  are  used: 

• Toponyms  within  the  Tweet  text 

• GPS  annotations  on  Tweets 

• Toponyms  in  user  profiles 

As  a basic  difference  from  previous  studies,  for  location  estimation,  Dempster- 
Shafer  Theory  (DST)  (Dempster  1967),  which  can  combine  evidence  from 
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various  resources  in  a single  model,  is  used.  DST  is  a generalization  of  Bayesian 
inference  technique.  DST  is  a technique  under  cases  in  which  evidences  are 
limited  or  they  provide  conflicting  data.  In  addition,  estimation  result  can  be 
given  in  a belief  interval  such  as  [X%,  Y%]  instead  of  a single  probability  value. 

In  social  networks,  the  number  of  postings  sent  from  populated  regions  may 
distort  the  result,  since  the  average  number  of  messages  posted  from  such  loca- 
tions are  already  higher  than  the  other  places  and  this  leads  to  a bias  towards 
such  places.  Hence,  in  Ozdikis  et  al.  (2013)  a normalization  operation  is  devised 
and  applied  on  evidence  probabilities  to  prevent  this  bias. 

As  the  evidence  from  the  toponyms  within  the  Tweet  texts,  the  authors  ini- 
tially employed  a gazetteer-based  approach,  however,  other  toponym  recogni- 
tion techniques  such  as  machine  learning-based  approaches  can  be  incorpo- 
rated into  the  system. 

In  Ozdikis  (2013),  experiments  are  conducted  on  a set  of  Tweets  posted 
about  minor  two  earthquakes  in  Turkey,  which  occurred  on  May  17,  2013  near 
the  city  of  Mugla,  and  on  July  30,  2013  on  an  island  near  the  city  of  Canakkale. 
They  were  not  strong  quakes,  and  did  not  cause  any  damage,  but  they  were 
felt  by  people  in  these  regions  and  triggered  a reaction  in  Twitter  traffic.  For 
the  first  event,  the  belief  interval  found  for  Mugla  is  [0.77,  0.97],  which  can 
be  considered  as  a confident  decision.  For  the  second  event,  the  highest  belief 
interval  is  found  for  Canakkale,  as  [0.24,  0.42].  The  next  highest  belief  intervals 
are  [0.16,  0.34]  for  Edirne  and  [0.14,  0.31]  for  Tekirdag.  Although  Istanbul  is 
a metropole  from  which  many  messages  are  posted,  the  belief  interval  as  the 
location  of  the  event  is  calculated  only  as  [0.03,  0.21].  Hence,  the  results  show 
the  applicability  of  the  method  for  location  estimation. 


Conclusion 

Social  networks,  especially  microblogs,  contain  valuable  data  including  geo- 
graphical footprints  of  users.  A message  becomes  geo-annotated  in  terms  of 
latitude  and  longitude  if  the  user  posts  it  using  a GPS-enabled  device  and 
allows  the  sharing  of  this  geographic  information.  In  addition,  users  tell  their 
location  in  their  social  network  profile.  Moreover,  users  may  talk  about  places 
in  their  messages.  Such  attributes  of  Tweets  are  valuable  resources  for  spatial 
analysis.  There  are  various  studies  on  location  estimation  that  use  these  attrib- 
utes. Most  of  them  rely  on  a single  attribute.  However,  techniques  that  combine 
several  attributes  in  a single  model  for  location  estimation  have  more  potential 
for  accurate  estimation. 

Each  of  these  attributes  cause  different  challenges.  In  geo-annotated  mes- 
sages, GPS  coordinates  provide  precise  geographic  position  in  terms  of  latitude 
and  longitude.  However,  the  number  of  geo-tagged  messages  is  very  limited.  In 
addition,  GPS  coordinate  and  the  location  of  the  event  mentioned  in  the  Tweet 
may  not  be  the  same. 
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Similar  issues  appear  for  location  attribute  of  user  profiles.  The  number  of 
profiles  including  valid  location  information  is  very  limited.  In  addition,  hav- 
ing valid  location  information  in  the  profile  does  not  guarantee  that  the  user  is 
actually  at  that  location  at  the  time  of  message  posting,  or  the  user  is  mention- 
ing that  location  in  the  Tweet. 

Challenges  with  the  Tweet  content  and  location  attributes  of  user  profiles 
mostly  concern  text  processing  and  toponym  recognition.  The  quality  of  con- 
tent is  not  as  good  as  that  in  news  articles  due  to  misspellings,  extraordinary 
writing  conventions  and  abbreviations.  Therefore,  state-of-the-art  NLP  parsers 
do  not  perform  as  accurately  on  informal  social  media  message  texts.  The  cur- 
rent efforts  mostly  rely  on  applying  a normalization  process  before  toponym 
recognition.  The  research  trend  is  towards  minimizing  the  normalization  effort 
and  proving  a more  general  adaptive  approach. 

Toponym  recognition  and  location  estimation  on  social  media  research  in 
the  literature  are  mostly  applied  on  Twitter  posts.  This  is  due  to  the  fact  that 
Twitter  has  a high  number  of  users  and  most  of  the  data  in  Twitter  is  pub- 
licly available  through  its  application-programming  interface.  However  the 
described  techniques  are  also  applicable  to  other  crowd- sourced  social  media. 

As  a future  work,  several  other  research  dimensions  can  be  pursed  for  topo- 
nym recognition  and  location  estimation.  For  toponym  recognition,  various 
other  features  can  be  analyzed.  Hashtags  appear  to  be  valuable  resources.  In 
addition,  links  or  photographs  attached  in  the  messages  can  be  further  investi- 
gated. Meta  features  or  tags  of  the  pictures  and  videos,  too,  may  provide  topo- 
nyms.  For  location  estimation,  in  addition  to  combination  of  features  from  a 
single  data  source,  evidence  from  several  complementary  data  resources,  such 
as  Twitter  and  Foursquare,  may  be  combined  for  increasing  the  precision  of  the 
location  prediction. 
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Abstract 

The  public  have  used  Twitter  world  wide  for  expressing  opinions.  This  study 
focuses  on  spatio-temporal  variation  of  georeferenced  Tweets  sentiment  polar- 
ity, with  a view  to  understanding  how  opinions  evolve  on  Twitter  over  space 
and  time  and  across  communities  of  users.  More  specifically,  the  question  this 
study  tested  is  whether  sentiment  polarity  on  Twitter  exhibits  specific  time- 
location  patterns.  The  aim  of  the  study  is  to  investigate  the  spatial  and  temporal 
distribution  of  georeferenced  Twitter  sentiment  polarity  within  the  area  of  1 
km  buffer  around  the  Curtin  Bentley  campus  boundary  in  Perth,  Western  Aus- 
tralia. Tweets  posted  in  campus  were  assigned  into  six  spatial  zones  and  four 
time  zones.  A sentiment  analysis  was  then  conducted  for  each  zone  using  the 
sentiment  analyser  tool  in  the  Starlight  Visual  Information  System  software. 
The  Feature  Manipulation  Engine  was  employed  to  convert  non-spatial  files 
into  spatial  and  temporal  feature  class.  The  spatial  and  temporal  distribution 
of  Twitter  sentiment  polarity  patterns  over  space  and  time  was  mapped  using 
Geographic  Information  Systems  (GIS).  Some  interesting  results  were  identi- 
fied. For  example,  the  highest  percentage  of  positive  Tweets  occurred  in  the 
social  science  area,  while  science  and  engineering  and  dormitory  areas  had 
the  highest  percentage  of  negative  postings.  The  number  of  negative  Tweets 
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increases  in  the  library  and  science  and  engineering  areas  as  the  end  of  the 
semester  approaches,  reaching  a peak  around  an  exam  period,  while  the  per- 
centage of  negative  Tweets  drops  at  the  end  of  the  semester  in  the  entertain- 
ment and  sport  and  dormitory  area.  This  study  will  provide  some  insights  into 
understanding  students  and  staff  s sentiment  variation  on  Twitter,  which  could 
be  useful  for  university  teaching  and  learning  management. 


Keywords 

spatial  and  temporal  analysis;  sentiment  analysis;  Twitter;  georeference 


Introduction 

Twitter  as  one  of  vital  platforms  for  people  to  publically  express  their  opinions 
and  feelings  about  events  and  their  private  lives,  has  attracted  enormous  atten- 
tion with  millions  of  followers  (Li  et  al.  2013).  Numerous  studies  have  been 
conducted  on  opinion  mining  and  sentiment  analysis  on  Twitter  (Pang  & Lee 
2008;  Poria  et  al.  2014;  Taboada  et  al.  2011;  Liu  2012).  The  sentiment  classi- 
fication methods  have  been  developed  from  simple  text  mining  to  advanced 
symbol  and  feature  recognition  (Liu  2012),  from  a pure  sentiment  analysis  to 
a sentiment  and  subjective  analysis  (Pang  & Lee  2004),  from  machine  learn- 
ing or  lexicon-based  approaches  to  more  advanced  hybrid  methods  (Serrano- 
Guerrero  et  al.  2015)  and  from  sentiment  orientation  with  only  two  directions 
(e.g.  positive  and  negative)  coarse  measurement  scale  to  a fine  grained  classifi- 
cation (Fink  et  al.  201 1).  However,  limited  sentiment  analysis  research  has  been 
conducted  from  a spatial  and  temporal  perspective.  This  study  tested  a research 
question  of  whether  sentiment  polarity  on  Twitter  exhibits  specific  time-loca- 
tion patterns. 

The  aim  of  this  paper  is  to  investigate  the  spatial  and  temporal  distribution 
of  georeferenced  Twitter  sentiment  polarity  within  Bentley  campus,  Cur- 
tin University  in  Perth  Western  Australia.  The  campus  was  divided  into  six 
zones  - science  and  engineering  buildings,  social  science  buildings,  library,  lec- 
ture theatre,  dormitory,  entertainment  and  parking  areas  and  four  periods  of 
time  - beginning  of  the  semester,  middle  of  the  semester,  end  of  semester  and 
after  examination  to  investigate  how  Twitter  sentiments  vary  across  different 
zones  and  time  periods.  The  Starlight  Visual  Information  System1  was  used  to 
conduct  a sentiment  analysis.  The  Feature  Manipulation  Engine  (FME2)  was 
employed  to  convert  non-spatial  files  into  spatial  and  temporal  feature  class. 


1 http://starlight.pnnl.gov. 

2 http://www.safe.com. 
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The  sentiment  polarity  patterns  across  six  spatial  zones  and  four  temporal 
zones  were  mapped  using  the  ArcGIS3  software. 

The  paper  is  structured  as  follows.  Section  2 presents  the  research  context 
including  a review  of  the  relevant  research  literature  associated  with  methods 
for  a Twitter  sentiment  analysis.  The  research  method  in  Section  3 describes 
the  overall  approach  taken  in  this  research.  Section  4 presents  the  results,  and 
the  paper  concludes  with  a discussion  of  key  findings  and  implications  of  our 
observations  on  different  Twitter  topics  in  Section  5. 


Related  work 

Sentiment  analysis  is  the  Natural  Language  Processing  work,  which  involves 
opinion  detection  and  classification  of  attitudes  in  texts  (Balahur  et  al.  2014). 
Numerous  studies  have  been  conducted  for  automatically  detecting  opinions 
and  emotions.  This  section  summarised  these  studies  into  two  categories. 


Sentiment  classification  trends 

The  early  studies  of  sentiment  classification  are  mostly  based  on  text  mining 
techniques.  Opinion  was  classified  into  positive/negative  or  positive/negative/ 
neutral.  It  can  be  simple  two,  five  or  even  eleven  point  scale  depending  on  the 
complexity  of  a task  (Taboada  et  al.  2011;  Pang  & Lee  2008;  Pang  & Lee  2004; 
Whitelaw  et  al.  2005).  Human  language  tends  to  be  subjective.  The  same  sen- 
tence in  different  tones  or  contexts  could  in  different  emotional  states.  It  creates 
a great  challenge  to  identify  the  affective  state  or  intended  emotional  commu- 
nication (Sarvabhotla  et  al.  2011;  Pang  & Lee  2004;  Wilson  et  al.  2005).  There- 
fore, subjectivity  analysis  can  go  beyond  simple  category  of  positive,  negative 
or  neutral  (Liu  2012).  Some  studies  focus  on  detecting  ironic  and  sarcastic  con- 
tent of  texts.  However,  there  is  a huge  debate  of  how  to  formally  define  irony 
and  sarcasm,  which  add  another  dimension  of  subjectivity  analysis  (Reyes 
& Rosso  2012).  Except  extracting  polarity  of  a given  text,  more  fine-grained 
methods  have  been  developed  to  detect  emotion  or  opinions  from  symbols, 
such  as  emoticons  (e.g.  ‘O’,  ‘©’)  (Read  2005;  Go  et  al.  2009)  and  visual  features 
or  images  (Liu  2012).  This  progression  of  the  studies  has  taken  the  sentiment 
analysis  research  to  a new  level. 


Classification  methods 

In  order  to  perform  different  sentiment  classification  tasks,  various  sentiment 
algorithms  were  developed  (Medhat  et  al.  2014;  Serrano-Guerrero  et  al.  2015). 


3 http : / / www.  esri.  com/ software/ arcgis . 
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Medhat  et  al.  (2014)  grouped  the  SA  into  two  categories:  machine  learning  and 
lexicon-based  approaches  (see  Figure  1).  Generally,  machine-learning  methods 
were  used  to  automatically  discover  sentiment  polarity  pattern  rules  in  large 
data  in  order  to  learn  opinions  or  emotions  of  given  texts  or  features.  A variety  of 
algorithms  have  been  developed  (Ye  et  al.  2009;  Rushdi  Saleh  et  al.  2011).  Most 
algorithms  fall  into  the  category  of  supervised  machine  learning.  For  example, 
Rushdi  Saleh  et  al.  (2011)  applied  Support  Vector  Machines  (SVM)  to  detect 
whether  the  opinion  expressed  in  a document  is  positive  or  negative  about  a 
given  topic  using  several  weighting  schemes.  Ye  et  al.  (2009)  compared  Naive 
Bayes,  SVM  and  the  character  based  N-gram  model  for  analysing  the  sentiment 
of  travel  blogs  for  seven  popular  travel  destinations  in  US  and  Europe.  The  SVM 
and  N-gram  approach  was  found  to  outperform  the  Naive  Bayes  approach  with 
accuracies  reaching  to  at  least  80%.  Balahur  (2013)  developed  an  unsupervised 
method  especially  for  a Twitter  data  sentiment  analysis  using  the  SVM.  The 
major  contribution  of  the  study  is  to  employ  methods  in  normalising  Tweet 
language,  including  higher  order  n-grams  to  spot  modifications  in  sentiment 
polarity  articulated  and  selecting  features  using  simple  heuristics. 

Lexicon-based  approaches  focus  on  measuring  subjectivity  and  opinions 
in  texts  using  Semantic  orientation  (SO)  (Osgood  et  al.  1957),  which  capture 
orientations  of  opinions  (positive  or  negative)  and  strengths  or  degrees  of  ori- 
entation (Taboada  et  al.  2011).  Sentiment  lexicons  are  the  key  for  this  type 


Sentiment  Analysis  — 


Machine  Learning 
Approach 


Supervised  Learning  — 


Decision  Tree 
Classifiers 


Linear 

Classifiers 


Support  Vector  Machines 


Rule -based 
Classifiers 


Probabilistic 

Classifiers 


Unsupervised  Learning 


Lexicon -based 
Approach 


Dictionary-based 

Approach 


Statistical 

Corpus-based 

Approach 

Semantic 

Neural  Network 


Naive  Bayes 


Bayesian  Network 


Maximum  Entropy 


Figure  1:  Sentiment  classification  techniques  (Source:  Medhat  et  al.  2014). 
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of  methods.  For  example,  Paltoglou  and  Thelwall  (2012)  proposed  a lexicon- 
based  approach  to  identify  whether  a text  conveys  negative  or  positive  attitudes 
and  to  estimate  the  level  of  emotional  intensity  of  a text  in  social  media  and 
microblogging  environments.  They  added  extensive  linguistically  function- 
alities (negation/capitalization  detection,  intensifier/diminisher  detection  and 
emoticon/ exclamation  detection)  to  the  traditional  classifiers  such  as  Support 
Vector  Machines  (SVM),  Maximum  Entropy  classifier  and  Naive  Bayes  clas- 
sifiers. Khan  et  al.  (2014)  utilised  a hybrid  system  framework,  which  contains 
unsupervised  learning  algorithms  and  a dictionary-based  method  named 
Twitter  Opinion  Mining  Framework  (TOM).  This  method  applied  a variety  of 
techniques  for  Twitter  analysis  and  classification  including  a hybrid  scheme  of 
Enhanced  Emoticon  Classifier  (EEC),  SentiWordNet  Classifier  (SWNC)  and 
an  improved  polarity  classifier  (IPC)  using  a list  of  positive/negative  words. 
The  findings  reveal  that  the  proposed  algorithm  resolved  previous  technical 
issues  and  increased  the  classification  accuracy,  effectively  reduced  the  number 
of  classified  neutrals.  Dacres  et  al.  (2013)  conducted  a topic  analysis  and  sen- 
timental analysis  in  understanding  the  contents  of  Tweets  and  trends  posted 
over  a 10-day  period  using  machine  learning  and  natural  language  processing 
techniques.  The  researchers  examined  and  compared  the  commenly-used  Data 
Science  Toolkit’s  text2sentiment4,  which  is  based  on  different  methods,  such 
as  sentiment  lexicon  (Nielsen  2011),  the  lexicon-based  but  data-driven  hybrid 
SentiStrength  (Thelwall  et  al.  2012),  and  Charrerbox’s  Sentimental  API  (Purver 
& Battersby  2012).  In  Dacres  et  al.  (2013)  analysis,  best  result  was  achieved  by 
the  machine  learning  method  (Charrerbox’s  Sentimental  API)  with  84%  accu- 
racy. In  our  study,  we  have  also  adopted  a machine  learning  method  using  Sub- 
space  Transformation  (TRUST)  engine  in  Starlight  for  vector  space  modelling 
and  supervised  learning  for  a sentiment  analysis. 


Methods 

Study  area  and  data  collection  methods 

Study  Area 

The  developed  content  analysis  methods  were  implemented  using  Tweets 
within  the  area  of  1 km  buffer  around  Curtin  Bentley  campus  boundary  (see 
Figure  2).  Curtin  University  is  one  of  the  largest  universities  in  Australia.  It 
has  more  60,000  students  enrolled  each  year  (OFFICE  OF  STRATEGY  AND 
PLANNING  2014).  The  Bentley  campus,  as  the  main  campus  of  Curtin 
University,  is  located  about  six  km  southeast  from  the  Perth  CBD.  It  covers 
116  hectares  with  a variety  of  facilities,  such  as  a library,  lecture  theatres, 
teaching  rooms,  cafes,  dormitories  and  parking  areas  (Curtin  University  2015). 
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Figure  2:  Curtin  University  Area  Map  (Map  data  © OpenStreetMap  contributors). 


Data  Gathering  And  Pre-processing 

Because  the  objective  of  this  paper  is  to  understand  spatial  and  temporal  pat- 
terns of  georeferenced  Tweets  emotional  polarity,  we  only  used  geotagged 
Tweets  posted  between  12  May  2014  and  5 Jan  2015  within  the  area  of  1 km 
buffer  around  Curtin  Bentley  campus  boundary  as  JSON  files  via  Twitter  API 
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and  converted  to  XML  format  using  the  Ruby  built-in  JSON  library.  More  than 
5,000  Tweets  were  downloaded  from  Twitter  API  during  the  study  period  but 
around  one  third  of  them  were  located  outside  the  Curtin  University  boundary 
After  removing  the  ones  outside  Curtin  University  boundary,  there  were  a total 
of  3,172  Tweets  gathered.  These  XML  files  of  Tweets  consist  of  both  geo -location 
information  and  attribute  information  such  as  text  of  Tweets,  time  created  at  and 
geolocation,  which  are  directly  relevant  to  our  research.  Tweets  of  non-English 
languages  were  removed  for  this  study  in  order  to  simplify  the  analysis,  leaving 
3,097  Tweets  in  the  dataset.  In  our  study,  location  associated  with  each  Tweet 
is  in  the  format  of  (x,  y)  coordinates,  which  were  automatically  captured  using 
built-in  Global  Positioning  System  (GPS)  receivers  in  mobile  or  tablet  applica- 
tions, such  as  a smart  phone.  This  is  the  exact  location  where  a user  Tweeted. 
The  accuracy  of  location  recorded  by  GPS  is  usually  around  a few  meters.  How- 
ever, if  the  location  captured  by  triangulation  outside  a cellular  network,  the 
accuracy  can  range  between  30-3000  m,  subjecting  to  the  cell  distribution  (Li 
et  al.,  2013).  These  geotagged  Twitter  data  were  imported  into  a geodatabase 
using  GIS  software,  such  as,  FME  and  ArcGIS,  as  point  features,  Curtin  Build- 
ings and  boundary  geography  files  were  extracted  from  Open  Street  Map4  and 
converted  into  the  same  geodatabase  using  the  ArcGIS  software. 


Sentiment  analysis  methods 

There  are  three  levels  of  sentimental  analysis  -1)  document  level  sentiment 
classification;  2)  sentence  level  sentiment  classification;  and  3)  aspect  level  sen- 
timent analysis  (Liu  2012).  This  research  chose  the  sentence  level  sentiment 
classification.  The  sentence  level  sentiment  classification  is  suitable  because  of 
its  assumption  that  each  sentence  contains  only  one  entity  or  one  aspect  of 
entity  in  many  cases  (Liu  2012).  Each  Tweet  has  a limitation  of  140  characters 
in  order  to  ensure  that  information  posted  on  Twitter  is  straight  forward  to 
the  theme.  As  a result,  it  is  more  suitable  to  perform  a sentence  level  senti- 
ment analysis  on  Tweets.  This  study  carried  out  a sentiment  analysis  by  using 
the  Starlights  Sentiment  Analyzer  function  in  Starlight  Data  Engineer  (SDE), 
which  adopted  the  Boeing  Text  Representation  using  Subspace  Transformation 
(TRUST)  engine  for  vector  space  modelling  and  text  summarisation  (Simoff 
et  al.  2008).  The  input  for  Sentiment  Analyzer  is  XML  files.  Sentiment  Ana- 
lyzer analyses  individual  words  of  a Tweet  and  calculates  a score  of  sentiment 
orientation.  It  returns  statistics  from  the  sentiment  analysis,  such  as,  senti- 
mentTotal,  sentimentDiff,  sentimentScore,  wordCount,  sentimentNegative, 
sentimentPositive.  In  this  study,  sentimentDiff  is  the  sentiment  orientation  of 
Tweets.  SentimentDiff  is  the  result  of  sentimentPositive  subtracting  sentiment- 
Negative (Liu  2012).  If  sentimentDiff  is  positive  number,  it  means  this  text  hold 


4 http://www.openstreetmap.org 
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a positive  attitude.  If  sentimentDiff  is  negative,  it  means  the  text  expresses  a 
negative  opinion.  These  simple  statistics  returned  from  the  Sentiment  Analyzer 
are  defined  as  the  followings: 

• Sentiment  Total:  the  sum  of  sentimentPositive  and  sentimentNegative. 

• SentimentDiff:  sentimentPositive  subtract  sentimentNegative. 

• SentimentScore:sentimentDiff  divided  by  WordCount. 

• WordCount:  the  total  number  of  words  in  the  text. 

• SentimentNegative:  each  negative  term  in  the  text  represent  by  1.  Senti- 
mentNegative is  the  total  number  of  negative  words  in  the  text.  For  exam- 
ple, text  such  a bad  day!  Everything  is  wrong’.  ‘Bad’  and  ‘wrong’  makes  the 
sentimentNegative  to  be  2. 

• SentimentPositive:  each  positive  term  in  the  text  represents  by  1.  Senti- 
mentPositive is  the  total  number  of  positive  words  in  the  text.  For  instance, 
sentence  ‘study  hard  and  the  review  is  quite  efficient’,  ‘hard’  and  ‘efficient’ 
makes  the  sentimentPositive  score  to  be  2. 

From  the  analysis  above,  the  range  of  sentiment  orientation  scores  were  derived 
from  this  study  from  5 to  -6.  The  sentiment  category  is  shown  in  Table  1. 
The  scale  we  used  does  not  consider  severity  of  individual  words,  but  their 
frequency. 

Sentiment  orientation  scores  ranging  from  4 to  5 represent  Tweets  expressing 
very  positive  sentiment  and  sentiment  orientation  scores  ranging  from  1 to  3 
mean  Tweets  holding  positive  sentiments.  Sentiment  orientation  score  0 means 
Tweets  do  not  express  any  opinions.  Sentiment  orientation  scores  from  -4  to  -6 
represent  Tweets  holding  very  negative  sentiment  while  sentiment  orientation 
scores  ranging  from  -1  to  -3  mean  negative  attitudes. 


Spatial  and  temporal  comparison  of  Twitter  sentiment 
polarity  patterns 

We  divided  Curtin  Bentley  University  campus  into  six  spatial  zones  and  four 
time  periods  (see  Table  2).  The  spatial  distribution  of  georeferenced  Tweets 


Sentiment  Category 

Sentiment  orientation  scores 

Very  Positive 

4-5 

Positive 

1-3 

Natural 

0 

Negative 

(—!)-(— 3) 

Very  Negative 

( — 4)— ( — 6) 

Table  1:  Sentiment  category. 
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Study  Period  Category 

Time  Span 

Beginning  of  the  semester 

28  July  2014-7  September  2014 

Middle  of  the  semester 

6 May  2014  - 18  May  2014  and  8 September  2014  - 
19  October  2014 

End  of  the  semester 

19  May  2014  - 27  June  2014  and  20  October  2014  - 
28  November  2014 

After  examination 

28  June  2014  - 27  July  2014  and  29  November  2014  - 
5 January  2015 

Table  2:  Study  period  category 


among  six  zones  was  derived  using  ArcGIS  software.  In  addition,  the  ArcGIS 
software  was  used  to  assign  all  Tweets  with  time  span  information  and  exported 
to  an  excel  spread  sheet  for  further  analyses. 


Research  Hypothesis 


• Spatial  difference 

Tweets  posted  in  different  campus  locations  are  various  in  sentiments.  For 
example,  Tweets  posted  in  Entertainment  and  parking  area  may  be  more 
positive  while  Tweets  posted  in  study  areas  may  be  more  negative. 

• Temporal  difference 

Adnan  et  al.  (2012)  conducted  research  on  student  stress  levels  at  the 
beginning  and  the  end  of  the  semester  and  found  the  stress  level  of  uni- 
versity students  varies  throughout  a semester.  Generally  speaking,  students 
felt  more  stressful  at  the  end  of  the  semester  than  at  the  beginning  of  the 
semester.  Based  on  this  study,  we  proposed  a hypothesis  that  more  positive 
Tweets  are  posted  at  the  beginning  of  semester  and  after  examination  com- 
pared with  the  ones  posted  at  the  middle  of  the  semester  and  at  the  end  of 
the  semester. 


Results 

Overall  distribution  of  the  Tweets  by  sentiments 

Three  thousand  and  ninety- seven  Tweets  were  loaded  into  the  Starlight  Data 
Engineer  for  a sentiment  analysis  and  output  Tweets  were  assigned  sentiment 
polarity.  Then  output  Tweets  from  Starlight  were  further  processed  in  FME  to 
be  converted  into  feature  classes.  Table  3 illustrates  the  overall  distribution  of 
the  Tweets  by  sentiments.  Around  45%  of  Tweets  contain  neutral  opinions,  such 
as  ‘I’m  at  Curtin  University’.  Besides,  the  number  of  Tweets  holding  positive 
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Sentiments 

Score  range 

Number  of  Tweets 

Very  positive 

4-6 

32 

Positive 

1-3 

1,091 

Neutral 

0 

1,383 

Negative 

(—!)-(— 3) 

577 

Very  Negative 

( — 4)— ( — 6) 

14 

Table  3:  Overall  distributions. 

opinions  is  nearly  as  twice  as  Tweets  of  negative  opinions.  Fourteen  Tweets  con- 
tain very  negative  feelings  while  thirty-two  Tweets  have  very  positive  opinions. 

Spatial  distribution  of  sentiment  polarity  patterns 

As  it  is  showed  in  Figure  3 and  Table  4,  most  Tweets  were  posted  in  the  enter- 
tainment and  parking  and  dormitory  area.  Tweets  posted  in  the  science  and 
engineering  area  do  not  have  very  negative  opinions,  while  Tweets  posted  in 
lecture  theatre  and  library  do  not  contain  very  positive  opinions.  Tweets  posted 
in  social  science  areas  have  the  highest  percentage  of  positive  Tweets  (42.1%), 
while  Tweets  gathered  in  the  library  area  have  the  largest  percentage  of  negative 
opinions  (21.8%  of  total  Tweets  in  the  area).  About  20.7%  of  total  Tweets  posted 
in  science  and  engineering  area  hold  negative  opinions,  while  18%  of  Tweets  in 
social  science  are  negative.  Therefore  the  descriptive  data  potentially  indicates 
that  students  or  staff  in  the  science  and  engineering  areas  could  feel  slightly 
more  negative  compared  with  students  or  staff  in  the  social  science  areas. 

Temporal  distribution  of  sentiment  polarity  patterns 

Sentiment  orientation  at  different  study  time  zones  is  showed  in  Table  5.  It  is 
interesting  to  note  that  the  largest  percentage  of  negative  feeling  actually  occur 
at  the  beginning  of  semester,  which  is  21.6%.  This  is  different  from  our  hypoth- 
esis. The  percentage  of  negative  sentiments  decreases  from  the  beginning  of  the 
semester  zone  to  after  exam  zone,  reaching  to  smallest  number  of  negative  opin- 
ions (0.08)  after  examination.  The  percentage  of  very  negative  Tweets  is  roughly 
the  same  throughout  all  four  study  zones.  After  examination  time  zone  holds  the 
largest  percentage  of  positive  Tweets,  which  is  align  with  the  research  hypothesis. 

The  library  area  contains  a large  cluster  of  Tweets  (See  Figure  4),  which  shows 
an  interesting  temporal  pattern.  At  the  beginning  of  the  semester,  still  more 
positive  Tweets  occurred  than  negative  ones  in  the  library  area.  However,  as  a 
semester  goes,  more  negative  Tweets  were  posted  in  the  library  area.  The  per- 
centage of  positive  Tweets  in  each  temporal  zone  decreased  over  time  gradually. 
We  also  summarised  the  number  of  negative  opinions  on  Twitter  across  five 
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■ Curtin  tweets 
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Sentiments  At  Spatial  Zones 

N 

A 2i>  

Figure  3:  Sentiments  at  spatial  zones. 


Spatial  zones 

Very 

Positive 

Positive 

Neutral 

Negative 

Very 

Negative 

Total 

Science  and 
Engineering 

5 (2.7%) 

59  (31.4%) 

85  (45.2%) 

39  (20.7%) 

0 (0%) 

188 

Social  Science 

3 (1.1%) 

110(42.1%) 

99  (37.9) 

47(18%) 

2 (0.8%) 

261 

Library 

0 (0%) 

85  (32.6%) 

118(45.2%) 

57  (21.8%) 

1 (0.4%) 

261 

Lecture  Theatre 

0 (0%) 

22  (40.0%) 

27  (49.1%) 

4 (7.3%) 

2 (3.6%) 

55 

Dormitory 

5 (1.6%) 

260  (32.1%) 

387  (47.8%) 

156(19.3%) 

2 (0.2%) 

810 

Entertainment 
and  Parking 

19  (1.2%) 

555  (36.4%) 

667  (43.8%) 

275  (18.1%) 

7 (0.5%) 

1523 

Table  4:  Sentiments  for  spatial  zones. 

*The  percentage  of  very  positive  Tweets  in  the  Science  and  Engineering  area. 
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TimeZones 

Very 

Positive 

Positive 

Neutral 

Negative 

Very 

Negative 

Total 

Beginning  of 
semester 

12  (1.5%) 

275  (35.6%) 

316(40.9%) 

167  (21.6%) 

3 (0.4%) 

773 

Middle  of 

semester 

11  (0.9%) 

407  (33.6%) 

542  (44.8%) 

244  (20.2%) 

6 (0.5%) 

1210 

End  of  semester 

7 (0.8%) 

334  (36.0%) 

431  (46.5%) 

151  (16.3%) 

4 (0.4%) 

927 

After  exam 

2(1.1%) 

76  (40.4%) 

94  (50.0%) 

15  (8.0%) 

1 (0.5%) 

188 

Table  5:  Sentiment  polarity  of  temporal  zones. 


Time  Zones 

Very 

Positive 

Positive 

Neutral 

Negative 

Very 

Negative 

Total 

Beginning  of  the 
semester 

0 

12  (35%*) 

15  (44%) 

7 (21%) 

0 

34 

Middle  of  the 

semester 

0 

34  (33%) 

47  (46%) 

22  (21%) 

0 

103 

End  of  the  semester 

0 

38  (31%) 

56  (46%) 

28  (23%) 

1 (1%) 

123 

After  examination 

0 

1 (100%) 

0 

0 

0 

1 

Table  6:  Sentiment  polarity  of  temporal  zones  in  the  library  area. 
*The  percentage  of  positive  Tweets  in  the  beginning  of  semester 


Time 

Zones 

Dormitory 

Science  and 
Engineering 

Entertainment 
and  Parking 

Lecture 

Theatre 

Library 

Social 

Science 

Total 

Beginning 
the  of 

semester 

49 

11 

86 

1 

7 

13 

167 

Middle 
of  the 

semester 

72 

13 

116 

3 

22 

19 

245 

End  of  the 

semester 

31 

14 

63 

0 

28 

15 

151 

After 

examination 

4 

1 

10 

0 

0 

0 

15 

Table  7:  The  number  of  negative  opinions  on  Twitter  over  space  and  time. 


spatial  zones  and  four  temporal  zones  (see  Table  7).  Interestingly,  except  the 
science  and  engineering  area,  the  number  of  negative  opinions  dropped  at  the 
end  of  semester  in  the  other  three  areas.  Certain  temporal  patterns  of  Tweeter 
polarity  only  occurred  at  specific  locations.  This  may  indicate  that  Twitter 
sentiments  are  time-location  specific  and  they  might  depend  on  the  activities 
conducted  at  certain  locations. 
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Figure  4:  The  distribution  of  sentiment  polarity  over  time. 


Concluding  remarks 

This  study  presents  methods  for  analysing  spatial  and  temporal  patterns  of 
Twitter  sentiment  polarity  at  Curtin  University.  By  using  this  case  study,  we 
hope  the  results  can  provide  some  insights  into  understanding  the  spatial  and 
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temporal  variation  of  students  and  staffs  sentiment  polarity.  We  separated 
Tweets  into  five  different  spatial  zones  and  four  time  zones  and  tested  a research 
question  of  whether  sentiment  polarity  on  Twitter  exhibits  specific  time-loca- 
tion patterns. 

Interestingly,  although  the  number  of  positive  Tweets  posted  in  the  enter- 
tainment and  parking  area  were  larger  than  the  ones  posted  in  the  study  areas, 
the  highest  percentage  of  positive  Tweets  were  found  in  social  science  with  over 
40%.  Besides,  science  and  engineering  and  dormitory  areas  had  the  highest 
percentage  of  negative  postings  (both  over  19%).  It  indicates  that  the  sample 
of  students  in  social  science  students  tended  to  post  twitter  messages  contain- 
ing a higher  ratio  of  positive-to-negative  words  than  the  sample  of  students  in 
science  and  engineering  in  Curtin  university.  In  addition,  a trend  of  increasing 
number  of  negative  Tweets  was  identified  in  library  and  science  and  engineer- 
ing areas  as  a semester  goes  from  the  beginning  to  the  end,  reaching  a peak 
around  the  exam  period.  While  for  the  entertainment,  sport  and  dormitory 
area,  the  percentage  of  negative  Tweets  dropped  at  the  end  of  semester.  This 
could  mean  that  negative  feelings  might  be  associated  with  exam  induced  stress 
and  study  workload  (Adnan  et  al.  2012). 

The  spatial  and  temporal  sentiment  analysis  used  geotagged  Twitter  data, 
which  allow  a sentiment  polarity  analysis  at  a fine-grained  level.  This  method 
can  be  applied  in  many  areas,  such  as  polls  (Wang  et  al.  2012a),  consumer 
opinions  concerning  brands  (Jansen  et  al.  2009),  stock  market  performance 
(Bollen  et  al.  2011),  crime  prediction  (Wang  et  al.  2012b)  and  tourism  infor- 
mation (Shimada  et  al.  2011).  However,  there  are  a few  limitations  to  be  con- 
sidered when  making  conclusions  from  this  study.  For  example,  outputs  from 
the  sentiment  analysis  in  Starlight  Data  engineer  are  not  perfectly  accurate.  For 
example,  emotion  tokens  cannot  be  analysed  in  the  sentiment  analysis.  Some 
emotion  tokens,  such  as  ‘©’  and  ‘(g)’,  actually  express  very  obvious  attitudes. 
But  our  methods  in  the  sentiment  analysis  cannot  process  them.  Besides,  our 
sentiment  analysis  methods  cannot  handle  sarcasm  properly.  For  example,  a 
Tweet  of  ‘Well  done.  You  have  forgotten  your  umbrella  expresses  a negative 
feeling,  but  the  sentiment  analysis  misclassified  it  as  positive.  Some  Tweets  do 
not  have  sentiment  words  but  they  imply  some  emotions,  which  the  sentiment 
analysis  could  misclassify  them.  For  instance,  ‘Sleeping  pattern  is  sooo  screwed 
up,  no  more  4am’  appears  to  be  a negative  sentiment,  but  the  sentiment  analy- 
sis classified  it  as  neutral.  In  the  future,  this  study  will  adopt  more  advanced 
algorithms,  such  as  methods  developed  by  Liu  (2012),  Katz  et  al.  (2015)  and 
Poria  et  al.  (2014)  for  the  sentiment  analysis.  In  addition,  in  this  study,  we 
tested  our  hypotheses  in  only  one  university.  We  will  collect  Twitter  data  of 
major  universities  in  Western  Australia  and  conduct  a comparison  study  in 
the  future.  In  addition,  we  will  develop  a better  measure  of  sentiment  polarity, 
which  will  take  both  severity  and  frequency  of  individual  words  into  account 
in  the  future. 
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Abstract 

The  enormous  amount  of  data  generated  on  social  media  provides  vast  quanti- 
ties of  geo-referenced  data.  Volunteered  Geographic  Information  (VGI)  origi- 
nating from  social  networks  has  produced  new  challenges  for  research  and  has 
opened  opportunities  for  a wide  range  of  use  cases.  Smartphones  with  built-in 
GPS  sensors  enabled  users  to  easily  share  their  location  and  with  the  growing 
number  of  such  devices  available,  VGI  data  is  expanding  at  a rapid  rate.  Twit- 
ter is  one  of  the  most  popular  microblogging  services.  Its  a social  network  that 
enables  access  to  the  data  that  is  being  created  on  the  platform.  It  also  allows  for 
real-time  retrieval  of  data  from  a given  geographic  area. 

In  this  paper  we  give  an  overview  of  a system  for  detecting  and  identifying 
social  hotspots  from  Twitter  stream  data  and  applying  sentiment  analysis  on 
the  data.  Utilizing  the  Twitter  Streaming  Application  Programming  Interface 
(API),  we  collected  a significant  number  of  Tweets  from  New  York  and  we 
evaluated  the  quality  of  the  retrieved  data.  In  this  paper,  we  outline  advantages 
and  disadvantages  of  using  various  clustering  algorithms  over  the  data  for  this 
purpose,  namely  hierarchical  agglomerative  clustering  and  DBSCAN.  We  also 
elaborate  on  techniques  for  identifying  social  hotspots  from  spatially  localized 
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clusters.  Finally,  we  present  a deep  learning  approach  to  sentiment  analysis 
used  to  determine  the  attitude  of  users  participating  in  the  identified  social 
hotspots. 
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Introduction 

Volunteered  Geographic  Information  (VGI)  as  coined  by  Goodchild  (2007) 
is  defined  as  the  harnessing  of  tools  to  create,  assemble  and  disseminate  geo- 
graphic data  provided  voluntary  by  individuals.  VGI  received  a lot  of  traction 
in  recent  years  with  the  continuing  evolution  of  Web  technologies  that  provide 
for  easier  user  participation  in  the  creation  of  such  data  with  users  assuming 
a more  active  role  in  the  creation  of  VGI  data  (Sui  2011).  This  idea  stimulated 
the  popularity  of  a number  of  services  such  as  Wikimapia,  OpenStreetMap 
and  many  others.  Wikimapia  is  based  on  the  same  concept  as  Wikipedia  and 
it  contains  over  23.8  million  objects.  Flickr  on  the  other  hand,  a system  for 
hosting  images  allows  users  to  upload  geo-tagged  photos  on  the  platform  with 
the  corresponding  latitude  and  longitude  pair  that  are  associated  with  the  pic- 
ture. Alternatively,  people  can  also  act  as  sensors  and  just  provide  their  location 
through  the  use  of  various  social  network  services  such  as  Facebook,  Twitter, 
Foursquare  etc. 

Taking  advantage  of  VGI  or  otherwise  known  as  User  Generated  Spatial 
Content  (UGSC)  leads  to  a vast  number  of  applications  ranging  from  public 
health,  early  disaster  warning  and  crisis  management  to  various  types  of  ana- 
lytics useful  to  marketing  agencies  and  companies. 

Social  networks  and  microblogging  platforms  have  attracted  the  attention  of 
users  on  a huge  scale  over  the  recent  years.  Platforms  such  as  Facebook,  Ins- 
tagram,  Foursquare  and  many  others  generate  massive  amounts  of  data.  These 
services  also  allow  users  to  geo-locate  the  information  they  share  on  these  net- 
works. This  lead  to  popularization  of  Social  Media  Geographic  Information 
(SMGI)  which  refers  to  the  geo -referencing  of  multimedia  data  extracted  by 
social  media  applications.  With  the  spread  of  mobile  devices  equipped  with 
GPS  sensors,  it  become  even  more  accessible  for  users  to  share  their  location 
in  order  to  provide  more  context  to  the  content  they  are  sharing.  As  of  2015, 
Twitter,  the  most  popular  microblogging  platform  of  all,  has  over  300  million 
monthly  active  users  and  generates  over  500  million  messages  on  a daily  basis1. 
One  key  advantage  of  Twitter  is  its  real-time  component,  positioning  the  plat- 
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form  as  one  of  the  most  up  to  date  data  source,  as  witnessed  by  its  ability  to 
break  news  before  other  sources. 

Bloggers  in  the  Twitter  community  use  the  platform  to  express  their  views 
and  ideas  on  different  topics,  share  thoughts  on  their  daily  activities,  celeb- 
rity gossip  etc.  Although  only  a small  percentage  of  Tweets  (1.2%)  (Dredze 
2013)  contain  exact  location,  the  sheer  volume  of  messages  generated  every 
day,  makes  Twitter  a gold  mine  for  mining  VGI  data.  Since  users  Tweet  about 
events  around  them  in  real-time,  we  could  tap  into  this  information  stream  and 
identify  social  hotspots  as  they  are  emerging.  Furthermore,  applying  sentiment 
analysis  on  the  content  shared  related  to  a social  hotspot,  can  provide  addi- 
tional insight  about  the  place  or  event. 

Sentiment  analysis  on  social  media  has  wide  applications  as  it  can  be  used 
to  provide  feedback  for  the  reaction  products  and  services  receive,  the  public 
opinion  towards  different  candidates  during  political  elections  etc.  The  pre- 
sented work,  focuses  on  analysis  of  sentiment  related  to  social  hotspots. 


Related  Work 

A lot  of  research  has  been  conducted  to  explore  the  various  applications  of  vol- 
unteered geographic  data  from  social  media  (Sui  2011).  Companies  have  also 
showed  interest  in  the  area  along  with  the  field  of  sentiment  analysis  because  of 
its  potential  to  provide  valuable  insight  into  peoples  reaction  regarding  related 
products  and  the  distribution  over  geographical  areas  (Liu  2014). 

Dredze  et  al.  (2013)  explored  the  application  of  Twitter  geo-located  data  to 
public  health.  They  developed  a system  that  infers  structured  location  informa- 
tion from  Tweets  and  showed  how  this  information  can  be  used  for  influenza 
tracking.  However,  their  approach  only  detects  location  on  city  level  and  it’s  not 
able  to  detect  finer  grained  locations.  In  the  work  of  Li  (2013),  he  addressed  the 
issues  of  extracting  local  information  and  discovering  communities  of  interest 
in  local  social  media.  Bosch  et  al.  (2013)  propose  a visual  analytics  approach  to 
facilitate  sensemaking  of  geo-located  microblog  posts  by  enabling  analysts  to 
create  automatic  methods  for  extracting  messages  and  by  applying  those  meth- 
ods when  monitoring  topics  of  interest. 

Twitter  as  a source  of  geo-data  has  been  used  in  various  domains,  many  of 
which  focus  on  event  detection  from  Twitter  data.  Abdelhaq  et  al.  (2013)  pre- 
sented a framework  for  detection  of  localized  events  in  real-time  from  Twitter 
streams  and  tracking  their  evolution  over  time.  The  proposed  system  uses  both 
geo-located  and  non-geo-located  Tweets  to  identify  event  describing  words, 
but  only  geo -located  Tweets  are  used  to  determine  the  spatial  distribution  of 
such  words.  Spatial  and  temporal  characteristics  of  keywords  are  continuously 
extracted  to  identify  meaningful  candidates  for  event  descriptions.  The  sys- 
tem selects  words  that  show  bursty  frequency  in  the  current  time  frame  and 
have  local  spatial  distribution.  Keywords  are  then  clustered  by  their  spatial 


226  European  Handbook  of  Crowdsourced  Geographic  Information 


signatures  and  clusters  are  scored  to  show  how  likely  is  that  they  represent  a 
localized  event.  However,  this  approach  does  not  perform  very  well  in  situa- 
tions when  there  are  multiple  geo-terms  within  the  same  text. 

Kisilevich  et  al.  (2010)  developed  a new  version  of  the  DBSCAN  clustering 
algorithm  for  analysis  of  places  and  events  using  a collection  of  geo-tagged  pho- 
tos. The  assumption  is  that  a high  photo  activity  in  a specific  area  is  indicative 
of  an  interesting  place  or  an  ongoing  event.  The  proposed  algorithm  addresses 
an  issue  that  is  specific  to  identifying  social  hotspots,  which  occurs  when  a 
significant  portion  of  the  samples  in  a cluster  originates  from  a single  user. 
In  our  work,  we  also  utilize  the  DBSCAN  clustering  algorithm  and  tackle  this 
issue  in  the  cluster  detection  phase  where  features  are  generated  indicative  of 
the  number  of  users  in  a cluster.  Evaluation  of  the  approach  is  done  on  Flickr 
images  from  Washington,  D.C. 

Walther  et  al.  (2013)  propose  a system  that  detects  geo-spatial  events  from 
the  Twitter  stream.  The  proposed  system  create  clusters  or  EventCandidates  in 
a manner  similar  to  the  DBSCAN  clustering  algorithm.  Clusters  are  further 
analyzed  to  detect  if  any  overlaps  have  occurred,  both  spatially  and  temporally. 
Several  hand-crafted  features  were  developed  that  the  authors  consider  to  be 
indicative  of  an  actual  event  and  these  features  are  extracted  from  each  cluster. 
Finally,  an  evaluation  whether  a cluster  represents  an  actual  event  or  not  is 
made  using  a machine  learning  approach.  Our  system  builds  on  the  work  of 
Walther  et  al.  (2013).  Additionally,  we  extend  the  system  by  applying  sentiment 
analysis  on  the  identified  social  hotspots. 


Data  Retrieval 

Retrieving  Twitter  messages  is  available  through  the  Twitter  Application  Pro- 
gram Interface  (API)  which  offers  a variety  of  REST  endpoints.  Nonetheless,  in 
order  to  continuously  collect  Tweets  from  a certain  geographic  location,  gener- 
ating repeated  REST  calls  is  infeasible  due  to  Twitter  rate  limits.  As  a result,  we 
must  utilize  the  Twitter  Streaming  API  that  gives  low  latency  access  to  Twitter  s 
global  stream  of  data.  The  stream  can  return  Tweets  originating  from  a set  of 
users  or  messages  that  contain  certain  keywords.  However,  the  possibility  of 
supplying  the  Streaming  API  with  a filter  specified  by  a set  of  spatial  bound- 
ing boxes  defined  by  latitude  and  longitude  pairs  is  of  interest  in  our  work. 
This  filter  provides  real-time  access  to  all  messages  originating  from  a given 
geographic  area. 

A Tweet  and  its  location  are  available  through  the  Twitter  public  streams 
if  the  user  explicitly  consents  to  sharing  the  location  in  the  post.  Users  can 
enable  locations  on  their  devices  and  provide  exact  coordinates  obtained  from 
a GPS  sensor  of  the  device  they  used  to  post  the  message.  Additionally,  Twitter 
enables  for  manual  embedding  of  places  in  messages.  Places  on  Twitter  have 
specific  IDs,  defined  by  a bounding  box  and  can  be  of  several  different  types 
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(city,  POI,  country).  For  a Tweet  with  an  embedded  place  to  be  returned,  the 
bounding  boxes  of  the  place  and  the  filter  applied  in  the  stream  must  intersect. 
One  drawback  of  retrieving  Tweets  with  places  is  that  the  user  location  may 
not  be  the  same  as  the  one  mentioned  in  the  Tweet.  A user  can  post  a Tweet 
containing  a mention  of  a place  while  Tweeting  from  somewhere  else.  Fur- 
thermore, many  places  refer  to  larger  areas,  cities  or  even  countries  which  may 
feed  the  stream  with  Tweets  that  are  outside  of  the  defined  bounding  box.  As 
a result,  it  is  necessary  to  additionally  filter  the  incoming  data.  MongoDB  and 
its  document  model  is  probably  the  most  suitable  database  system  for  storing 
social  media  posts.  In  addition,  it  supports  temporal  and  geo-spatial  indexes 
which  is  essential  to  the  task  at  hand  as  we  are  dealing  with  geographical  and 
temporal  data. 

In  future  work,  it  would  be  valuable  to  explore  enriching  the  data  with  Twit- 
ter messages  that  are  not  explicitly  geo -tagged.  As  for  geo -tagged  Tweets,  the 
percentage  of  geo-tagged  ones  with  exact  coordinates  goes  as  high  as  1.2% 
(Dredze  2013),  while  1.3%  contain  a Place  object.  In  order  to  increase  the 
utilization  of  the  geo -referenced  information  generated  on  Twitter,  one  must 
look  beyond  explicitly  volunteered  geo -data.  Users  often  include  references  to 
geographical  information  in  the  content  they  post  on  the  Web,  without  tying 
it  to  specific  coordinates.  Location  recognition  in  social  media  is  a challeng- 
ing problem,  even  more  so  in  Twitter  due  to  the  140  character  limitation  and 
the  abundance  of  abbreviations,  informal  language  or  terms  used  only  in  the 
Twitter  community. 

In  order  to  enrich  dataset  with  VGI,  one  must  first  define  a set  of  keywords 
to  track  using  the  Streaming  API,  as  the  number  of  keywords  that  can  be  fed 
to  the  API  is  limited  to  400.  In  order  to  get  Tweets  most  related  to  social  hot- 
spots, the  stream  should  be  fed  with  keywords  relevant  to  social  hotspots  and 
with  words  related  to  the  area  that  is  being  monitored.  Liu  (2014)  developed 
an  extensive  system  to  disambiguate  and  identify  locations  mentioned  in  text 
and  for  estimating  user  location  out  of  their  activity.  The  approach  relies  on 
sequential  learning  methods  to  automatically  learn  the  relations  between  parts 
of  locations  where  the  classifiers  are  fed  with  hand-crafted  features. 


Dataset 

The  system  that  is  showcased  in  the  remainder  of  the  paper  is  based  on  Twitter 
data  from  New  York  between  February  22  and  April  16  2015.  We  set  the  bound- 
ing box  to  the  following  longitude  and  latitude  pairs:  (-74,  40),  (-73,  41).  New 
York  City  is  chosen  because  its  one  of  the  most  active  cities  Twitter-wise.  Also, 
we  assume  that  the  majority  of  the  messages  will  be  in  English.  New  Yorks 
big  population  and  its  dense  social  places  structure  pose  both  difficulties  and 
advantages  for  mining  geo-data.  On  one  hand,  the  abundance  of  social  places 
suggests  that  a relatively  high  number  of  Tweets  will  be  related  to  social  places. 
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On  the  other  hand,  the  dense  structure  poses  challenges  to  precise  clustering 
of  Tweets. 

For  the  above  mentioned  period,  we  collected  4,125,542  Twitter  messages, 
generated  from  a total  of  226,114  distinct  users.  Out  of  these,  3,274,724  mes- 
sages contained  exact  coordinates,  while  the  others  had  a place  entity  only, 
which  were  attached  manually.  However,  we  observed  that  among  these  Tweets 
that  also  had  place  entities  attached,  a significant  portion  was  outside  of  the 
defined  bounding  box.  Upon  filtering,  the  dataset  was  reduced  to  2,350,739 
messages. 

In  the  set  collected,  we  observed  that  place  entities  generally  are  related  to 
greater  geographical  areas.  For  example,  places  such  as  ‘Manhattan,  ‘New  York’ 
or  ‘Brooklyn  appear  very  often,  as  opposed  to  points  of  interests.  Such  places 
are  insignificant  to  our  analysis  and  have  to  be  filtered  out  because  they  refer  to 
a very  broad  area.  Only  9,119  Tweets  with  embedded  POIs  that  are  within  the 
defined  bounding  box  were  retrieved,  while  269  messages  have  POIs  outside  of 
the  bounding  box. 


System  Overview 

The  presented  system  in  this  work  monitors  Twitter  streams  for  a defined 
geographic  area  and  identifies  social  hotspots  as  they  are  emerging.  Figure  1 


Twitter  Streaming  API 


4 


Data  collection 

i 

Tweets  pre-processing 

i 

Clustering 

i 

Cluster  feature 
extraction 

i 

Social  hotspot  detection 

i 

Sentiment  analysis 

Figure  1:  System  architecture. 
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depicts  an  overview  of  the  system  architecture.  The  system  can  be  broken  down 
to  the  following  key  components: 

• Data  retrieval  - connects  to  the  Twitter  Streaming  API  and  collects  mes- 
sages from  a certain  geographic  area. 

• Tweet  pre-processing  - cleans  Tweets  from  noisy  tokens  and  characters. 

• Cluster  generation  - creates  clusters  or  social  hotspots  candidates  from 
Tweets  retrieved  in  a limited  time  period. 

• Cluster  feature  extraction  - extracts  relevant  features  from  Tweets  in  a 
social  hotspot  candidate 

• Social  hotspot  detection  - analyses  clusters  and  evaluates  using  machine 
learning  whether  they  represent  a social  hotspot  or  not. 

• Sentiment  analysis  - analyses  the  sentiment  of  the  Tweets  in  the  social  hot- 
spot clusters. 


Social  Hotspot  Detection 

A social  hotspot  is  a geographic  POI  which  attracts  the  attention  of  many 
people  in  a limited  period  of  time.  In  the  context  of  Twitter,  detecting  social 
hotspots  requires  locating  places  with  highly  concentrated  activity.  In  order  to 
identify  social  hotspots  from  Twitter  streams,  clusters  of  geographically  close 
Tweets  must  be  created.  There  are  several  ways  of  generating  such  clusters,  few 
of  which  are  elaborated  in  the  remainder  of  this  work. 

Clustering  is  an  unsupervised  machine  learning  method  where  samples 
from  a given  set  of  data  are  grouped  based  on  features  describing  each  sample. 
For  purposes  of  the  system  described  here,  only  the  geographical  component 
of  the  Tweets  is  taken  into  consideration  for  the  clustering  phase.  Clustering 
algorithms  compute  distance  between  samples  from  the  dataset.  Since  we  are  deal- 
ing with  geo-data  in  relatively  confined  areas,  the  most  appropriate  metric  is 
Euclidean  distance. 

Algorithms  such  as  the  K-means  algorithm  that  require  the  number  of  clus- 
ters to  be  predefined  are  not  appropriate  for  the  specific  problem,  because  the 
number  of  clusters  or  social  hotspots  candidates  cannot  be  anticipated.  One 
way  of  overcoming  this  issue  is  by  using  hierarchical  agglomerative  clustering. 
Each  observation  or  Tweet  starts -off  as  single  cluster.  Iteratively,  pairs  of  clusters 
are  merged  together  as  the  algorithm  moves  up  the  hierarchy.  The  result  is  a 
dendogram  from  which  the  threshold  value  can  be  observed  which  is  used  to 
cut  the  dendogram  and  to  prevent  it  from  building  into  the  complete  hierar- 
chy. In  order  to  get  sufficiently  localized  clusters  the  threshold  has  to  be  set  to 
a very  small  value.  However,  the  complexity  of  hierarchical  clustering  ranges 
from  ©(N2)  to  ©(N3),  depending  on  the  selected  linkage  criteria.  Another  defi- 
ciency is  that  it  requires  the  pairwise  distances  of  all  the  observations  in  the  set. 
Computing  pairwise  distances  has  a ©(N2)  memory  complexity,  which  can  be 
infeasible  if  the  number  of  input  data  is  huge. 
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Another  appropriate  clustering  technique  is  DBSCAN.  It  is  a density  based 
clustering  algorithm,  not  limited  to  shapes  of  clusters  and  only  relies  on  a 
neighborhood  count  parameter  (minPts)  and  a neighborhood  distance  e.  The 
general  algorithm  classifies  each  sample  as  core  points,  reachable  points  and 
outliers  as  follows: 

• Core  points  have  at  least  minimum  number  (minPts)  of  other  points  within 
e distance 

• A point  p is  density  reachable  if  there  is  a point  q and  a path  pl  ...  pn  where 
pY=p  and  pn  = q andp.+1  is  directly  reachable  fromp.  and  is  a core  point 

• A point  is  an  outlier  if  it  is  not  reachable  from  any  other  point 

Points  with  a density  above  the  specified  threshold  are  constructed  as  clus- 
ters. DBSCAN  handles  outliers  well,  and  has  been  proven  as  very  effective  in 
processing  very  large  databases.  It  is  by  far  most  suitable  for  the  task  at  hand 
as  it  gives  close  control  over  what  is  considered  a social  hotspot  candidate. 
ST-DBSCAN  is  also  a density-based  algorithm  for  clustering  spatial-temporal 
data.  Birant  et  al.  (2007)  first  proposed  this  approach  as  an  extension  on  the 
existing  DBSCAN  in  relation  to  the  identification  of  core  and  noise  objects  and 
adjacent  clusters.  ST-DBSCAN  takes  into  consideration  the  non-spatial,  spatial 
and  temporal  attributes  of  the  data.  This  is  especially  important  in  this  case 
as  social  hotspots  are  not  fixed  in  time.  The  algorithm  requires  two  additional 
parameters,  the  distance  parameter  for  non-spatial  attributes  and  A£  which  is 
used  to  prevent  the  discovering  of  combined  clusters.  Additional  modification 
is  that  a region  is  dense  if  the  minPts  criteria  is  satisfied  by  both  of  the  distance 
parameters.  ST-DBSCAN  is  also  efficient  at  handling  noise  points  when  there 
are  clusters  with  different  densities.  In  Figure  2,  the  figure  depicting  clusters 
generated  using  hierarchical  clustering,  only  clusters  containing  at  least  minPts 
messages  are  presented.  Different  marker  colors  are  used  for  better  clarity.  We 
observe  that  DBSCAN  generates  less  clusters  than  using  hierarchical  clustering. 
The  generated  clusters  need  to  be  further  analyzed  in  order  to  determine  if  the 
Twitter  messages  refer  to  an  actual  social  hotspot  or  are  just  random  non-related 
posts  or  conversations.  For  this,  we  borrow  on  the  work  of  Walther  et  al.  (2013). 

They  developed  several  features  divided  into  two  categories.  We  only  present 
the  ten  most  effective  features. 

• Unique  posters  - the  total  number  of  unique  users. 

• Common  theme  - calculates  word  overlap  between  different  Tweets  in  the 
cluster. 

• @ Ratio  - the  number  of  user  mentions  relative  to  the  number  of  Tweets 

• Unique  coordinates  - the  total  number  of  unique  coordinates  within  a 
cluster 

• Ratio  of  Foursquare  posts  - fraction  of  Tweets  originating  from  Foursquare 

• Tweets  count  - total  number  of  Twitter  messages 
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Figure  2:  Clusters  generated  with  hierarchical  and  DBS  CAN  clustering.  Each 
marker  represents  a separate  social  hotspot  candidate. 


• Semantic  Category  - whether  the  cluster  belongs  to  one  of  several  prede- 
fined categories 

• Subjectivity  - indicates  whether  users  share  subjective  posts  or  just  various 
information 

• Positive  sentiment  - indicates  positive  sentiment 

• Ratio  of  unique  posters  - the  number  of  unique  users  in  relation  to  the 
number  of  Tweets 
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The  effectiveness  of  the  described  features  are  experimentally  evaluated  in 
the  work  of  Walther  el  al.  (2013).  They  manually  annotated  1000  clusters  with 
binary  labels  in  respect  to  whether  they  represent  a real-world  event  or  not. 
Textual  features  proved  more  significant  than  the  Other  group,  but  a combina- 
tion of  both  feature  categories  provides  for  best  performance.  Walther  et  al. 
(2013)  used  three  different  machine  learning  algorithms,  specifically  Naive 
Bayes,  Multilayer  Perceptron  and  C4.5  decision  trees.  However,  it  would  be 
beneficial  to  evaluate  the  effectiveness  of  Support  Vector  Machines  and  other 
machine  learning  approaches. 


Sentiment  Analysis 

Sentiment  analysis  is  the  task  of  identifying  human  emotion  in  text.  Social 
networks  and  media  sparked  interest  in  sentiment  analysis  amongst  both  aca- 
demia and  industry  as  users  often  share  opinions  and  feeling  on  social  net- 
works (Pak  2010).  Observing  sentiment  regarding  social  hotspots  can  provide 
valuable  information  about  the  popularity  of  a place  or  an  event  that  is  taken 
into  consideration. 

So  far,  Twitter  sentiment  analysis  has  heavily  relied  on  hand-crafted  features, 
which  are  both  incomplete  and  too  domain  specific  and  depend  on  lexicons 
with  sentiment  polarity.  Additionally,  the  process  of  manual  feature  gen- 
eration is  time-consuming  and  requires  extensive  domain  knowledge.  Deep 
learning  techniques  on  the  other  hand,  automate  the  feature  generation  and 
are  more  robust  and  flexible  when  applied  to  various  domains.  Convolutional 
Neural  Networks  (CNN)  have  been  shown  to  achieve  state-of-the-art  results 
in  sentence  classification  and  specifically  in  sentiment  analysis  (Kim  2014), 
(dos  Santos  2014). 

We  have  developed  an  architecture  for  sentiment  analysis  that  uses  a CNN 
with  multiple  filters  with  varying  window  sizes.  The  model  is  built  on  the  work 
of  (Kim  2014)  where  they  report  state-of-the-art  performances  on  4 out  of  7 
sentence  classification  tasks.  It  consists  of  one  convolutional  layer  and  a max- 
over-time  pooling  layer  which  outputs  a fixed  sized  vector.  This  vector  is  then 
fed  to  a three  layer  feed-forward  network  with  two  non-linear  layers  and  a soft- 
max  output  layer  which  gives  the  probability  distribution  over  the  sentiment 
classes.  The  architecture  maps  each  token  in  a given  Tweet  to  an  appropriate 
word  representation.  The  approach  leverages  large  Twitter  corpora  for  unsu- 
pervised learning  of  these  word  representations,  which  capture  syntactic  and 
semantic  characteristics  of  words.  Instead  of  doing  the  pre -training  of  word 
embeddings  ourselves,  we  use  available  word  vectors.2  We  continuously  update 
word  vectors  by  back-propagation  during  training  time  and  by  doing  so  we 
capture  and  encode  sentiment  information  into  the  word  embeddings.  We  train 


2 http:// nlp.stanford.edu/projects/ glove/ 
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Figure  3:  Sentiment  heatmap. 
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the  deep  convolutional  neural  network  with  manually  annotated  Tweets  pro- 
vided by  the  Sentiment  Analysis  in  Twitter  task  on  SemEval-2015.  The  neural 
network  in  our  system  classifies  Tweet  sentiment  into  3 classes,  where  the  labels 
can  be  positive,  negative  or  neutral.  For  future  work,  it  would  be  interesting  to 
use  the  proposed  architecture  for  emotion  identification  which  may  provide  an 
even  deeper  insight  into  the  social  hotspot  popularity 
The  system  presents  the  overall  sentiment  of  a social  hotspot  which  can  even 
be  used  for  recommending  points  of  interests  to  users  as  it  can  provide  them 
with  feedback  for  the  popularity  of  a social  hotspot  in  real-time.  In  Figure  3 a 
sentiment  heatmap  is  depicted,  where  the  red  color  represents  negative,  blue 
represent  positive  and  green  neutral  Tweets. 


Conclusion 

In  this  paper,  we  have  presented  a system  for  detecting  and  identifying  social 
hotspots  from  Twitter  stream  data,  and  applying  sentiment  analysis  on  the  data. 
Utilizing  the  Twitter  Streaming  API,  we  have  collected  a significant  number  of 
Tweets  from  New  York  City  and  we  have  evaluated  the  quality  of  the  retrieved 
data.  Hierarchical  and  DBSCAN  clustering  algorithms  have  been  analyzed  for 
their  usefulness  in  generating  spatial  clusters.  We  also  elaborated  on  techniques 
for  identifying  social  hotspots  out  of  spatial  clusters.  Finally,  we  present  an 
approach  for  sentiment  analysis  based  on  deep  learning  that  is  used  to  deter- 
mine the  attitude  of  users  that  participate  in  the  hotspots. 
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Introduction 

During  the  last  two  decades,  the  role  of  internet  users  changed  dramatically. 
While  they  were  mostly  passive  content  consumers  before,  they  are  now  con- 
sidered proactive  data  producers.  This  phenomenon  is  summarized  by  the  term 
“Prosumer”  (Ritzer  & Jurgenson  (2010)  and  gets  facilitated  through  major 
technological  advancements  such  as  ubiquitous  access  to  the  mobile  Internet 
and  a widespread  use  of  smartphones  equipped  with  positioning  and  sensing 
capabilities.  These  outlined  developments  do  not  just  happen  recently,  but  trace 
back  to  the  much  older  development  around  the  so  called  “Web  2.0”  (ITU  2014). 
In  geospatial  terms,  these  developments  are  well  reflected  by  Mike  Goodchild  s 
popular  definition  of  ‘Citizens  as  Sensors  (Goodchild  2007),  where  ordinary 
people  capture  and  disseminate  “Volunteered  Geographic  Information  (VGI).” 
Haklay  further  puts  this  development  into  broader  context  and  rather  coined 
the  term  “Geo Web”  (Haklay  et  al  2008).  OpenStreetMap  (OSM)  is  probably 
the  most  prominent  example  of  VGI. 

Projects  like  OSM  provide  a well-defined  data  capturing  protocol  as  well  as  a 
clear  mission  regarding  their  contributed  contents.  In  contrast,  data  originating 
from  online  social  networks  (another  source  of  VGI)  is  way  more  heterogene- 
ous and  diverse.  At  the  same  time,  however,  it  may  also  provide  high  levels  of 
semantic  detail  and  is  generated  by  a larger  number  of  users.  Consequently,  it 
gained  the  interest  of  various  research  disciplines.  These  range  from  sociology 
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toward  linguistics,  and  of  course  geography  and  GIScience.  The  latter  one  is 
facilitated  by  the  fact  that  a great  deal  of  information  contributed  to  social 
media  is  geotagged.  Thus,  the  remainder  of  this  chapter  focuses  on  the  spatial 
aspects  of  social  media,  and  potential  applications  that  can  be  derived  from  this 
kind  of  data. 

Section  A highlights  the  general  potential  of  social  media  analysis  for  inves- 
tigating social  phenomena.  We  do  that  by  outlining  selected  case  studies  from 
the  exemplary  field  of  human  mobility  analysis.  These  have  demonstrated  the 
usefulness  of  social  media  for  investigating  mobility  patterns  as  well  as  human 
spatial  behavior.  Section  B then  provides  an  overview  of  several  different  appli- 
cation domains  of  social  media  analyses,  with  a particular  focus  on  Twitter. 
Finally,  Section  C discusses  some  technical  issues  of  established  spatial  analysis 
methods  and  their  application  to  social  media  data.  We  conclude  the  chapter  by 
summarizing  its  different  parts.  We  further  provide  recommendations  regarding 
future  areas  for  GIScience  research  on  social  media  data. 


Utilization  of  social  media  data  for  investigating  urban 
environments 

The  spatial  and  social  structures  of  a city  as  well  as  the  dynamic  nature  of  human 
activities  result  in  certain  collective  and  individual  human  behavior  patterns. 
Social  media  data  can  help  to  “sense”  this  type  of  information  from  urban  envi- 
ronments in  an  in-situ  manner.  GIScience  research  thereby  is  focused  on  the 
overall  question  how  corresponding  spatiotemporal  patterns  from  ubiquitous 
sensor  networks  and  heterogeneous  data  streams  can  be  explored,  extracted, 
validated  and  aggregated.  In  turn,  such  information  might  enable  us  to  sense 
everyday  spatial  processes  and  to  gain  knowledge  about  urban  environments, 
especially  with  respect  to  collective  human  dynamics.  The  study  of  these  issues 
has  become  one  of  the  primary  objectives  of  GIScience  (Giannotti  & Pedreschi 
2008). 

The  information  originating  from  social  media  messages  (e.g.,  Tweets  in  case 
of  Twitter)  may  contain  spatial,  temporal  and  semantic  attributes.  Considering 
these  dimensions,  social  media  can  be  considered  as  a (partial)  proxy  of  real 
world  happenings.  However,  space,  time  as  well  as  semantics  are  influenced  by 
each  user’s  individual  perception  of  the  surrounding  space.  It  is  thus  important 
to  figure  out  ways  to  circumvent  these  issues  for  gaining  trustworthy  and  objec- 
tive information  from  these  data  sources. 

The  following  short  paragraphs  outline  case  studies  in  which  a range  of  GIS- 
cience researchers  has  drawn  human  mobility  and  urban  study  related  knowl- 
edge from  Twitter.  We  group  these  studies  in  accordance  to  their  underlying 
research  goals.  The  listed  paragraphs  thus  provide  the  reader  a quick  overview 
of  both  the  types  of  studies  that  have  been  conducted  as  well  as  methods  and 
outcomes. 
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Mobility  and  social  behavior.  Studying  the  social  dynamics  of  a city  remains 
a challenging  endeavor,  which  has  recently  been  carried  in  a qualitative  man- 
ner. Thus,  social  media  might  be  a promising  source  of  information  in  order 
to  provide  a better  understanding  of  social  dynamics  within  urban  environ- 
ments and  resulted  in  various  research  efforts.  Regarding  the  analysis  of  collec- 
tive human  mobility  and  activity  patterns  from  social  media,  Cho  et  al.  (2011) 
investigate  social  ties  and  their  influence  on  human  mobility  patterns  by  com- 
paring social  media  check-in  data  and  cellphone  location  data.  They  found  a 
stronger  association  of  social  network  ties  influencing  long-distance  travel  than 
short  range  spatially  and  temporally  periodic  movements.  Within  the  observed 
Twitter  user  pattern,  Lee  & Sumiya  (2010)  study  user  behavior  by  measur- 
ing geographic  regularities  and  detecting  geo-social  events  through  identify- 
ing Regions  of  Interests  (Rol).  Another  approach  conducted  by  Noulas  et  al. 
(2011),  Cranshaw  et  al.  (2012)  and  Kafsi  and  Cramer  (2015)  is  the  identifica- 
tion of  characteristic  neighborhoods,  collective  movement  patterns  and  social 
ties  within  certain  user  communities  from  Foursquare  and  other  Social  media 
data.  In  a similar  approach  for  Twitter,  Li  et  al  (2014)  measure  the  spatial  dis- 
persion of  users  in  a community  and  their  trajectories.  Hawelka  et  al.  (2012) 
aim  to  further  empirically  validated  the  observed  human  behavior  patterns  and 
found  a correlation  between  the  conducted  Twitter  census  and  economic  key 
figures.  Furthermore,  Li  et  al.  (2013)  explore  spatiotemporal  patterns  of  Twitter 
and  Flickr  data  and  investigated  a relationship  between  socioeconomic  charac- 
teristics of  people  who  are  generating  social  media  posts  in  the  US. 

Mobility  and  underlying  urban  structures.  The  exploration  of  the  relation- 
ships and  the  impact  of  urban  structures  on  human  mobility  is  an  interesting 
study  area  for  social  media  researcher.  Wakamiya  et  al.  (2011)  investigate  tem- 
poral patterns  of  crowd  behavior  over  Japan  by  spatial  partitioning  Tweets  in 
order  to  extract  urban  characteristics.  On  a smaller  scale  several  studies  investi- 
gate the  connection  with  extracted  urban  activities  from  social  media  and  their 
connection  with  the  underlying  urban  structure.  Kling  et  al.  (2012)  were  able 
to  detect  spatiotemporal  clusters  of  frequently  occurring  urban  topics  in  New 
York.  Furthermore,  Ferrari  et  al.  (2012)  also  work  with  georeferenced  Tweets 
and  a semantic  probabilistic  topic  modeling  approach  to  automatically  extract 
urban  patterns  from  location-based  social  networks.  The  study  concluded  that 
extracted  urban  motion  patterns  and  identified  hotspots  in  the  city  allow  the 
inference  of  crowd  behaviors  that  recur  over  time  and  space.  A similar  approach 
by  using  Foursquare  data  by  Cheng  et  al.  (2011)  and  Hasan  et  al.  (2013)  also 
resulted  in  the  characterization  of  urban  human  mobility  and  activity  patterns. 
Andrienko  & Andrienko  (2013)  correlated  the  spatiotemporal  clusters  of  key- 
word based  filtered  georeferenced  Tweets  of  places  where  people  Tweet  with  US 
population  densities.  The  results  have  shown  strong  correlations  between  the 
observed  Twitter  distribution  and  census  data,  suggesting  that  social  media  is  a 
reliable  proxy  for  the  inference  of  mobility  patterns.  One  further  application  is 
to  derive  intra-urban  events  showing  distinct  mobility  patterns  over  time.  This 
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spatiotemporal  movement  has  proven  to  reflect  typical  mobility  behavior  in  the 
underlying  urban  structures  (Steiger,  Westerholt,  et  al.  2015). 

Mobility  and  human  activities.  Several  studies  infer  individual  and  collec- 
tive human  daily  activity  patterns  by  analyzing  crowdsourced  information, 
such  as  taxi  trip  records  (Liang  et  al.  2012),  GPS  traces  (Azevedo  8c  Bezerra 
2009)  (Jiang,  Yin,  & Zhao  2009)  or  mobile  phone  records  (Candia  & Gonzalez 
2008)  (Gao  2014).  Consequently,  a large  literature  body  also  focus  on  study- 
ing human  mobility  and  activity  pattern  from  social  media  data.  Krumm  et 
al.  (2011)  estimate  individual  home  locations  of  heavy  Twitter  user  and  apply 
machine  learning  algorithms  to  classify  and  predict  individual  travel  behavior. 
Jin  et  al  (2014)  developed  a method  to  infer  users’  mobility  patterns  from 
check-ins  in  Foursquare.  Coffey  8c  Pozdnoukhov  (2013)  go  one  step  further 
and  semantically  annotate  mobility  flow  datasets  with  activity  information 
and  trip  purposes  from  Tweets.  Similarly,  Wu  et  al  (2015)  utilize  social  media 
to  annotate  the  location  history  of  mobile  phone  users  for  the  characteriza- 
tion of  certain  social  activities.  Focusing  on  the  content  of  Tweets,  Grinberg 
et  al  (2013)  proposed  a method  to  detect  semantic  patterns  to  infer  clusters 
of  users’  real  world  activity.  Gao  (2014)  developed  a probabilistic  approach 
to  make  place  recommendations  based  on  the  users’  geo-social  circles,  as 
extracted  from  Foursquare.  In  another  study,  the  authors  estimate  spatiotem- 
poral mobility  flows  from  Twitter  for  the  area  of  greater  Los  Angeles  to  infer 
origin-  and  destination  trips  (Gao  et  al.  2014).  Results  have  shown  similar  pat- 
tern when  comparing  with  community  survey  data.  In  a previous  study  we 
introduced  a semantic  and  spatial  analysis  method  (Steiger,  Lauer,  et  al  2014), 
through  which  we  were  able  to  extract  geographic  features  from  uncertain 
Twitter  data  and  have  shown  that  observed  clusters  correspond  to  landmarks, 
such  as  highly  frequented  squares  and  major  transportation  hubs.  A further 
investigation  revealed  similar  semantic  layers  that  represent  collective  human 
mobility  flows  in  co-occurrence  with  underlying  social  activity  (Steiger,  Eller- 
siek,  et  al  2014)  and  could  thus  lead  to  new  insights  in  characterizing  urban 
mobility. 


Future  research  recommendations 

Further  research  needs  to  be  conducted  to  assess  the  reliability  of  social  media 
datasets.  It  also  must  be  noted  that  the  data  collected  from  wireless  devices  are 
influenced  by  GPS/ WIFI  inaccuracy  issues  (Zandbergen  and  Barbeau  2011). 
Moreover,  users  can  individually  choose  to  share  their  precise  location  to  a 
Tweet  or  just  a general  location  information  (such  as  a city  or  neighborhood). 
This  resulting  location  uncertainty  leads  to  imprecise  location  information  of 
geotagged  Tweets  (Li  et  al  2011). 

Within  the  semantic  attribute  one  must  consider  that  the  containing  infor- 
mation may  relate  to  events  in  the  past,  present  or  even  future  (Sengstock  and 
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Gertz  2012).  Principally  the  text  corpora  as  such  in  social  media  posts  are  rela- 
tively sparse  and  vague.  It  may  also  be  fairly  ambiguous  and  hence  observed 
phenomena  may  only  be  a weak  indicator  of  a real  world  event.  This  uncertain 
semantic  knowledge  is  a result  of  the  fact  that  people  using  Twitter  have  indi- 
vidual motivations  to  post  information  and  their  main  intention  is  to  primarily 
serve  their  own  communication  needs.  One  further  typical  characteristic  of 
social  media  is  that  users  do  not  post  equally  distributed  in  geographic  space 
and  time  leading  to  a heterogeneous  dispersion  of  posts.  Jatowt  et  al  (2015) 
further  assess  these  varying  temporal  patterns  and  dynamics  within  social 
media.  Furthermore,  georeferenced  social  media  posts  only  represent  a small 
fraction  of  the  overall  available  data.  Not  all  user  groups  use  all  types  of  social 
media  platforms  similarly,  which  produces  a potentially  strong  socio-demo- 
graphic  bias  (Longley  and  Adnan  2015).  Last,  the  application  of  spatial  and 
semantic  methods  themselves  creates  uncertainties,  since  the  distribution  of 
specific  geographic  phenomena  and  their  semantic  complexities  within  Tweets 
are  not  known  beforehand  (Westerholt  et  al  2015).  Hence,  it  is  important  to 
compare  and  validate  results  with  other  acquired  sensor  data. 

Conducting  further  research  in  this  area  however  will  be  worthwhile,  since 
study  results  may  provide  new  additional  insights  into  the  complex  human- 
sensor-city relationship  at  a much  more  fine-grained  spatial  and  temporal  level 
than  before.  New  knowledge  gained  from  this  research  will  provide  a better 
understanding  of  individual  and  collective  human  behavior  within  urban  envi- 
ronments and  may  assist  stakeholders  and  decision  makers  in  their  planning 
processes. 


Application  Domains  of  Social  Media  Analyses 

Location-based  social  networks  (LBSN)  (Roick  and  Heuser  2013)  offer  a vast 
amount  of  voluntary  content.  The  investigation  of  human  activities  in  location- 
based  social  networks  is  one  promising  example  of  exploring  spatial  structures 
in  order  to  infer  underlying  spatio  temp  oral  patterns.  Twitter  for  example  is 
more  and  more  recognized  by  numerous  research  domains.  In  particular  it 
provides  an  opportunity  for  GIScience  to  understand  geographic  processes  and 
spatial  relationships  comprised  in  social  networks.  Summarizing  the  current 
state  of  research  concerning  the  application  for  spatiotemporal  analyses,  one 
outcome  of  a previously  conducted  systematic  literature  (Steiger,  Albuquerque, 
et  al  2015)  revealed  that  Twitter  analyses  are  mainly  focused  on  the  spatiotem- 
poral classification  and  detection  of  events.  Principal  investigated  application 
domains  are: 

Event  Detection.  To  detect  events,  researchers  are  currently  looking  for  spa- 
tial, temporal  and  semantic  patterns  within  Twitter.  In  this  respect  peo- 
ple act  as  a social  sensors  for  events  (Yardi  and  Boyd  2010,  Chae  et  al 
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2012).  Disaster-  and  emergency  management  as  one  event  detection  sub- 
field has  been  the  primarily  identified  application  in  nearly  a third  of  all 
reviewed  studies  (Sakaki  et  al  2010,  Murthy  and  Longwell  2013,  Crooks 
et  al  2013).  Further  research  has  been  conducted  on  utilizing  Twitter  in 
traffic  management.  This  can  be  found  in  14%  of  reviewed  studies  (Kosala 
and  Adi  2012,  Wakamiya  and  Lee  2012,  Lenormand  et  al  2014).  Another 
area  which  seems  to  be  quite  popular  is  research  on  Twitter  data  for  dis- 
ease/ health  management  adding  up  to  another  5%  of  the  reviewed  studies 
(Lampos  and  Cristianini  2010,  Veloso  and  Ferraz  2011,  Sofean  and  Smith 
2012).  A famous  example  is  the  derivation  and  prediction  of  information 
on  infection  sources  and  the  spreading  of  an  illness  from  Twitter  messages 
(Culotta  2010,  Collier  et  al  2011).  One  prominent  example  is  earthquake 
detection  from  Twitter  data  (Longueville  et  al  2010,  Zook  et  al  2010). 
This  has  been  successfully  accomplished  in  a number  of  studies  correlat- 
ing results  with  official  earthquake  sensor  data  (Tapia  et  al  2011,  Thom- 
son et  al  2012).  Sakaki  et  al  (2010)  have  developed  an  algorithm  that  uses 
Twitter  to  calculate  earthquakes  epicenters  and  the  typhoons  trajectories. 
Moreover,  situational  information  can  be  derived  from  location-related 
short  messages  to  coordinate  emergency  responses  (Vieweg  et  al  2010). 
Also  in  the  context  of  disease  and  health  management  similar  outcomes 
have  been  derived.  Tweets  showing  disease  incidents  have  shown  simi- 
lar spatiotemporal  distributions  as  those  in  with  official  reports.  With 
these  studies  research  has  proven  the  trustworthiness  and  a high  level  of 
representativeness  of  Tweets  throughout  different  application  domains 
(Albuquerque  et  al  2015). 

Location  Inference.  Locations  of  users  within  social  networks  can  be  inferred 
or  even  predicted  with  the  help  of  direct  or  indirect  geolocation  informa- 
tion derived  from  the  provided  metadata  or  from  the  semantic  content 
(Kinsella  et  al  2011,  Hong  et  al  2012,  Hiruta  et  al  2012).  The  geographic 
accuracy  could  be  increased  by  extracting  the  textual  information  from 
the  Tweet  or  from  the  metadata  itself.  For  example,  Lamprianidis  and 
Pfoser  (2011)  have  extracted  locations  and  their  names  from  Flickr  pic- 
tures by  clustering  user-generated  data  points  associated  with  geo-refer- 
enced pictures.  Kelm  et  al  (2013)  discusses  various  methods  to  extract 
place  names  from  textual  data  from  articles,  posts  or  tags  in  geo-social 
networks,  including  place  name  gazetteer  and  statistical  language  mod- 
eling. Some  methods  follow  an  opposite  approach  and  infer  the  location 
of  a feature  from  implicit  location  information.  Serdyukov  et  al  (2009) 
model  the  probability  that  a group  of  tags  be  assigned  to  a location.  Sim- 
ilarly, (Gallagher  et  al  2009)  used  location  probability  maps  generated 
from  tags  for  the  same  purpose.  Van  Laere  et  al  (2010)  have  pursued  the 
same  goal  using  k-medoids  and  Naive  Bayes  clustering  methods.  Some 
approaches  focus  on  inferring  a users  or  a group  of  users’  location.  Cheng 
et  al  (2010)  have  proposed  a probabilistic  method  to  determine  users' 
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location  from  the  content  of  their  Twitter  messages.  Other  authors  have 
proposed  to  use  the  location  of  users'  friends  to  achieve  the  same  goal 
(Backstrom  et  al  2010).  Stefanidis  et  al  (2013)  have  proposed  a frame- 
work to  harvest  ambient  geospatial  information  from  social  media  feeds 
to  locate  social  hotspots  or  to  map  social  networks  in  a given  geographical 
area.  Ajao  et  al  (2015)  summarize  the  broad  range  of  available  techniques 
applied  to  infer  direct  and  indirect  location  from  Twitter  messages  and 
social  media  users. 

Geo-Social  Network  Analysis.  Another  important  domain  of  research  is 
social  analysis  which  investigates  relationships  of  individual  users  within  a 
social  network  (Wu  et  al  2011,  Cranshaw  et  al  2012).  Geo-social  network 
analysis  seeks  to  identify  the  structure  of  social  networks  and  their  distri- 
bution in  geographic  space  (Scellato  and  Mascolo  2010,  Lee  and  Sumiya 
2010).  Social  ties  may  feature  distinct  spatial  distributions  enabling  spati- 
otemporal  analyses.  These  distributions  can  help  finding  collective  social 
activities  and  ultimately  understanding  geographical  processes.  A subfield 
of  geo-social  network  analysis  are  sentiment  and  emotion  analysis  (Wang 
et  al  2012,  Quercia  et  al  2012).  This  field  of  research  also  offers  a great 
potential  for  GIScience  in  the  context  of  extracting  contextual  emotional 
information  within  urban  and  rural  environments.  One  promising  fur- 
ther field  of  research  within  social  analysis  which  should  be  mentioned  is 
urban  planning  and  management  which  also  could  benefit  from  the  rich 
data  found  in  location  based  social  networks  such  as  Twitter.  In  the  con- 
text of  disaster  management,  several  studies  aim  to  infer  the  social  dimen- 
sions within  certain  geo -located  communities  in  twitter  during  disaster 
events  (Conover  et  al  2013,  Bakillah  et  al  2014). 


Future  research  recommendations 

Social  Media  data  for  research  has  proven  to  be  a valuable  source,  as  it  not  only 
comes  for  free,  but  also  features  a high  spatiotemporal  resolution.  This  kind  of 
data  especially  enables  possibilities  to  find  spatial  patterns  and  events  which 
can  help  validating  existing  information  sources.  One  identified  main  research 
gap  is  the  exploration  of  human  spatial  behavior  (Miller  & Goodchild  2014) 
in  order  to  gain  knowledge  about  the  underlying  geographic  processes  and 
dynamics.  Furthermore,  the  current  research  foci  allow  to  transfer  established 
methods  from  various  disciplines  (e.g.  Computer-  and  Information  Science, 
Social  Science  etc.)  into  other  disciplines  and  enhancing  new  applications.  As 
one  example,  more  use  of  computer  linguistic  approaches  to  leverage  knowl- 
edge from  textual  information,  combined  with  methods  for  spatiotemporal 
analysis  from  computational  sciences  could  lead  to  new  insights  within  spe- 
cific geographic  application  domains,  such  as  disaster  management  or  human 
mobility  analysis. 
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Spatial  Analysis  of  Social  Media  Feeds  - Challenges 
and  Approaches 

The  primary  goal  of  spatial  analysis  is  to  explore  structures  within  spatial  data. 
This  typically  involves  tasks  like  finding  clusters  on  a map  or  figuring  out  distri- 
butional characteristics  of  data.  One  theoretical  field  underlying  spatial  analysis 
is  spatial  statistics.  This  field  provides  the  basic  principles  that  are  underlying 
many  spatial  analysis  problems.  Key  to  this  field  is  identifying  spatial  correla- 
tions, and  thus  hints  on  systematic  patterns  in  geographic  data  (Fischer  & Getis 
2010).  Respective  methods  and  techniques  are  thus  useful  tools  for  gaining 
geographic  insight  into  social  media  data. 

The  spatial  analysis  of  social  media  data  is  typically  conducted  in  an  explor- 
atory manner.  This  is  due  to  lacking  knowledge  about  potential  underlying 
spatial  processes,  and  thus  about  social  media  messages  and  their  dispersal  in 
geographic  space  in  general.  Useful  tools  on  that  regard  are  the  K-Function 
(Ripley  1976)  (purely  geometric)  and  the  mark  correlation  function  (Stoyan  & 
Stoyan  1994)  (attribute  values),  both  originating  from  spatial  point  pattern 
analysis.  These  methods  allow  identifying  significant  geometric  clustering 
and  regularity  within  stochastic  point  patterns.  When  the  geometry  is  fixed 
(or  rather  treated  as  such)  spatial  autocorrelation  statistics  like  Morans  I 
(Moran  1950,  Cliff  & Ord  1973)  and  hot  spot  statistics  like  Getis-Ord’s  G sta- 
tistics (Getis  & Ord  1992,  Ord  & Getis  1995)  are  suitable  alternatives.  These 
assess  the  degree  of  randomness  within  georeferenced  attributes  associated  to 
units  on  a fixed  geographic  layout.  In  fact,  many  of  the  latter  are  essentially 
identical  to  different  variants  of  the  mark  correlation  function  (see,  e.g.,  Shi- 
matani  2002).  Thereby,  Morans  I tests  for  correlations  between  neighbored 
observations  across  space,  while  G separates  between  extremal  values  (i.e., 
high  and  low). 

As  mentioned  earlier,  thorough  spatial  knowledge  about  social  media  data- 
sets is  typically  lacking.  Consequently,  analysts  oftentimes  proceed  with  a trial  - 
and-error  approach  when  parameterizing  the  methods  mentioned  above.  It  is 
common  practice  to  apply  these  techniques  to  different  scales.  The  goal  then  is 
to  sort  out  that  scale  at  which  patterning  seems  to  be  most  pronounced.  How- 
ever, the  techniques  mentioned  so  far  were  designed  long  before  the  appear- 
ance of  social  media  and  similar  kinds  of  user-generated  data.  The  idea  of  the 
following  two  sections  is  thus  to  briefly  reflect  differences  between  social  media 
and  more  traditional  data,  and  to  give  some  recommendations  with  respect  to 
the  spatial  analysis  of  these. 


Potential  issues  and  pitfalls 

The  issues  presented  in  the  following  are  likely  to  occur  when  analyzing  social 
media  feeds  with  established  methods  from  spatial  analysis.  It  is  important  to 
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note  that  social  media  feeds  provide  a mixture  of  indications  from  different 
real-world  (and  also  some  solely  virtual)  phenomena.  This  is  due  to  the  autono- 
mous manner  in  which  the  data  is  being  collected.  Users  can  contribute  any 
type  of  content  from  any  place  at  any  time.  Such  a mixture  might  be  benefi- 
cial in  terms  of  the  wealth  of  contained  information  about  the  users’  everyday 
lives.  However,  it  also  imputes  some  critical  problems  when  it  comes  to  spatial 
analysis.  Probably  the  most  trivial  yet  critical  among  these  is  the  mere  mixture 
of  information  as  such.  Any  attribute  which  is  derived  from  social  media  is 
highly  likely  to  include  information  from  several  different  real-world  phenom- 
ena. Analyzing  social  media  therefore  comes  at  the  risk  of  drawing  conclusions 
about  a mixture  population  that  might  not  exist  in  reality.  In  most  circum- 
stances this  is  not  desirable,  since  it  does  not  lead  to  reasonable  insight  about 
any  real-world  process.  One  way  to  overcome  this  problem  would  be  an  accu- 
rate a priori  semantic  separation.  However,  that  is  a non-trivial  task  on  its  own 
right  given  the  colloquial  language  used  in  corresponding  messages. 

Another  issue  with  social  media  data  is  the  implicit  subjectivity  that  is  per 
se  introduced  by  the  notion  of  “humans  as  sensors”  (Goodchild  2007).  One 
implication  from  that  concept  is  the  diversity  at  which  people  perceive  environ- 
ments (see  also  Section  A).  Similar  phenomena  might  lead  to  varying  responses 
among  different  users.  This  inevitably  leads  to  an  increased  difficulty  in  analyz- 
ing the  semantics  (i.e.  the  attribute  value)  of  the  observations;  and  thus  to  a 
potential  misclassification  of  phenomena.  The  implication  of  that  for  spatial 
analysis  is  crucial:  techniques  such  as  measures  of  spatial  autocorrelation  or 
spatial  regression  techniques  are  based  on  both,  spatial  characteristics  as  well 
as  the  attribute  values.  Consequently,  spatial  analysis  techniques  might  end  up 
in  spurious  results  when  the  analyst  fails  controlling  such  effects. 

The  analysis  of  social  media  can  also  lead  to  an  artificial  increase  in  the  number 
of  type  I / type  II  errors.  This  problem  is  likely  to  occur  whenever  testing  hypoth- 
eses about  spatial  patterns  with  social  media  datasets.  One  might  be  interested 
in  assessing  spatial  heterogeneity  by  means  of  local  statistics  like  local  Morans 
I (Anselin  1995)  or  G.*(Ord  and  Getis  1995).  It  is  common  sense  that  these 
methods  lead  to  an  increase  in  type  I errors  due  to  alpha  error  inflation  (Nelson 
2012).  Thus,  it  is  important  to  control  the  alpha  level  accordingly  (e.g.,  through 
techniques  such  as  False-Discovery-Rate  (Benjamini  & Hochberg  1995)).  With 
social  media  datasets,  however,  phenomena  operating  at  smaller  scales  than  the 
adjusted  analysis  scale  might  be  considered  by  accident;  and  inadvertently  influ- 
ence the  analysis.  This  is  due  to  the  mixture  described  above  which  is  leading 
to  spatially  overlapping  representations  of  different  phenomena.  The  result  is  an 
increased  amount  of  spurious  indications  of  significant  spatial  effects. 

Another  critical  implication  of  the  scale-mixture  outlined  above  is  a potential 
creation  of  wrong  and  misleading  relationships  across  scale  levels.  Recall  that 
observations  from  smaller  scale  levels  are  prone  to  inherently  being  included 
in  analyses  at  larger  scales  due  to  potential  geometric  mixture.  Effects  from 
smaller  scales  are  therefore  likely  to  be  propagated  towards  analyses  at  larger 
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scales.  Due  to  this  effect,  some  results  become  impossible,  e.g.,  in  scenarios 
where  one  wants  to  assess  spatial  autocorrelation  at  some  large  scale  that  is 
influenced  by  highly  autocorrelated  observations  from  smaller  scales.  If  there 
is  spatial  autocorrelation  present  at  some  small  scale  (e.g.  one  “heavy”  Twitter 
user  recurrently  posting  from  a particular  location),  it  will  be  carried  through 
to  all  larger  scales  being  observed  in  the  same  geographic  neighborhood. 

Further  discussion  of  these  and  related  problems  (including  some  empirical 
results)  can  be  found  in  Westerholt  et  al  (2015)  (including  a discussion  of  a 
multi-scale  modification  of  the  local  G statistic)  and  Lovelace  et  al.  (2016).  The 
presented  list  of  effects  is  of  course  not  exhaustive.  There  might  be  many  more 
effects,  some  of  which  are  still  about  to  be  discovered.  The  subsequent  section 
provides  some  hints  and  recommendations  about  how  to  precede  with  the  spa- 
tial analysis  of  social  media  data. 


Some  recommendations 

Spatial  autocorrelation  is  the  core  principle  underlying  a great  deal  of  spatial 
analysis  methodology.  Therefore,  it  is  crucial  to  accurately  assess  this  charac- 
teristic in  order  to  design  applicable  methods,  and  for  drawing  reasonable  geo- 
graphic conclusions.  This  is  not  just  important  for  exploratory  tests  on  spatial 
clustering  and  heterogeneity,  but  also  crucial  for  model-driven  spatial  regres- 
sion scenarios  such  as  Geographically  Weighted  Regression  (GWR)  (Fother- 
ingham  et  al.  2003)  and  for  assessing  model  misspecification  (Cliff  and  Ord 
1981).  Unfortunately,  in  case  of  social  media  analysis,  the  assessment  of  spatial 
autocorrelation  is  strongly  affected  by  the  problems  depicted  in  the  previous 
section.  Therefore,  one  recommendation  in  terms  of  future  research  is  to  work 
on  appropriate  adaptations  of  corresponding  measures  and  techniques  in  order 
to  account  for  multi-scale  (or  rather:  “mixed- scale”)  and  multi- categorical 
effects.  As  long  as  these  are  not  available,  one  should  carefully  parameterize 
respective  techniques.  Another  (aspatial)  approach  might  be  to  decompose 
social  media  datasets  a priori,  probably  based  on  some  other  characteristic 
such  as  the  Tweets  semantics.  The  worst  option  of  all,  however,  would  be  to 
neglect  the  specific  spatial  characteristics  of  social  media  data  when  conduct- 
ing spatial  analysis.  That  would  lead  to  a wrong  evaluation  of  spatial  effects;  and 
thus  to  wrong  analysis  results. 

Another  recommendation  is  related  to  one  of  the  promising  opportunities 
that  come  with  social  media  datasets:  their  wealth  of  information.  We  can 
obtain  an  array  of  valuable  and  potentially  interrelated  properties  from  social 
media  data.  These  include  temporal,  semantic  and  spatial  information.  Cor- 
respondingly, one  should  try  to  analyze  all  these  dimensions  simultaneously 
instead  of  considering  them  in  a separated  fashion.  This  might  unveil  a much 
deeper  understanding  of  social  phenomena  that  are  reflected  in  such  datasets. 
Recent  research  efforts  like,  e.g.,  Steiger  et  al.  (2015)  reflect  this  idea.  However, 
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it  yet  remains  a challenge  to  find  measures  to  incorporate  these  different  kinds 
of  information  in  joint  methodology  in  a reasonable  way 


Conclusion  and  an  outlook  on  future  work 

We  outlined  some  potential  pitfalls  when  analyzing  social  media  data  spatially 
These  are  caused  by  the  inherent  characteristics  of  the  data,  i.e.,  the  way  in 
which  the  data  is  collected  and  what  such  services  are  used  for.  Potential  prob- 
lems include  geometric  mixtures  of  differently  scaled  data;  semantic  mixtures 
that  get  blurred  in  joint  attributes  derived  from  the  data;  and  (more  generally) 
spurious  assessments  of  spatial  correlations  and  thus  pattern  in  the  data. 

The  previous  paragraphs  are  clearly  biased  towards  the  concept  of  spatial 
autocorrelation.  On  the  one  hand  this  focus  is  due  to  the  research  focus  of  the 
authors.  On  the  other  hand  this  is  due  to  the  central  role  which  spatial  autocor- 
relation plays  throughout  the  entire  field  of  spatial  analysis.  However,  there  are 
of  course  other  important  characteristics  and  pitfalls  that  might  also  influence 
the  spatial  analysis  of  social  media  data.  The  observations  come,  for  instance, 
with  considerable  uncertainties  with  respect  to  relevant  dimensions:  The  text 
snippets  are  colloquial  and  oftentimes  difficult  to  interpret  (semantics),  the 
time  stamp  is  sometimes  not  in  line  with  real-world  happenings  (temporal)  and 
the  geographic  coordinates  are  prone  to  positioning  inaccuracies  (spatial).  The 
intensities  of  all  these  uncertainties  appear  to  be  varying  across  different  users, 
devices,  regions,  etc.  All  these  uncertainties  indeed  have  impact  on  the  results 
of  spatial  analysis. 

Future  methodological  research  should  focus  on  the  specific  spatial  char- 
acteristics of  social  media  data  (that  are  not  yet  known  to  a full  extent).  For 
now,  across  all  disciplines  and  domains,  it  is  common  sense  to  apply  established 
standard  methodology  to  social  media  data.  Relatively  little  emphasis  is  put  on 
purely  methodological  research  on  the  background  of  the  special  characteris- 
tics of  these  datasets.  Thus,  there  is  still  plenty  of  room  for  improvement.  The 
discipline  of  GIScience  could  play  a vital  role  in  these  developments.  Beyond 
purely  empirical  research,  the  impact  of  the  spatial  disciplines  has  been  quite 
small  so  far.  However,  given  that  many  research  questions  around  social  media 
are  distinctive  spatial  ones,  we  should  put  much  more  emphasis  on  specialized 
spatial  analysis  techniques  for  social  media. 


Conclusion 

On  the  one  hand,  social  media  data  offers  an  array  of  new  perspectives  regard- 
ing many  research  questions  and  applications.  On  the  other  hand,  however, 
these  datasets  also  come  with  a set  of  issues  that  need  to  be  taken  into  account, 
in  particular  when  it  comes  to  spatial  analysis.  GIScience  can  contribute  to  the 
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development  of  new  spatial  analysis  methods  for  social  media  data.  Current 
major  issues  from  a GIScience  perspective  include: 

• the  need  of  spatial  analysis  methods  to  be  adapted  towards  uncertain  and 
unstructured  data  types  from  LBSN; 

• the  handling  of  geographic  scale  effects  when  analyzing  social  media  data; 

• the  need  for  combining  different  methods  across  disciplinary  boundaries 
(e.g.  social  network  analysis,  semantic  analysis,  spatiotemporal  analysis),  in 
order  to  better  utilize  all  available  information  dimensions; 

• the  development  of  data  fusion  and  information  extraction  methods  that 
take  several  different  data  sources  simultaneously  into  account. 

This  would  support  exploring  latent  patterns  and  sensing  geographical  pro- 
cesses from  social  media  data  in  a more  realistic  manner.  GIScience  could  thus 
contribute  to  answering  these  important  geographic  questions  and  may  play  a 
major  role  in  the  further  exploration  of  social  media  data. 
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Abstract 

During  the  last  few  decades  the  role  of  citizens  in  environmental  monitoring 
has  changed  remarkably  in  Finland.  In  this  chapter,  we  briefly  describe  this 
change  by  using  examples  of  both  traditional  and  modern  monitoring  sys- 
tems. According  to  our  findings,  there  are  at  least  four  important  drivers  chal- 
lenging traditional  monitoring  systems.  First,  the  monitoring  is  undergoing 
a rapid  process  of  globalisation  and  e.g.  the  systems  that  earlier  focused  on 
national  problems  are  today  controlled  by  European  legislation  or  influenced 
by  international  problems,  agreements  and  practices.  Second,  public  obliga- 
tions for  monitoring  have  grown  much  more  rapidly  than  economic  resources 
and  it  requires  the  monitoring  systems  to  have  a new  kind  of  ability  to  adapt  to 
changes.  Third,  the  migration  of  people  from  rural  areas  to  towns  has  reduced 
the  potential  of  a voluntary  workforce.  The  forth  driver  is  the  aging  of  the  vol- 
unteers. All  drivers,  without  new  monitoring  strategies,  challenge  both  the  per- 
formance and  geographical  coverage  of  monitoring  systems.  We  expect  that 
a combination  of  new  technologies,  such  as  remote  sensing,  the  Internet  of 
Things  and  Big  Data,  can  empower  new  groups  of  volunteers  and  increase  the 
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social  impact  and  effectiveness  of  voluntary  monitoring  to  fulfil  our  national 
and  international  obligations. 
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Introduction 

Monitoring  provides  a sufficient  level  of  information  to  support  decision- 
making and  comply  with  legal  requirements.  Volunteers  are  one  of  the  greatest 
resources  for  enforcing  environmental  laws  and  regulations.  Their  role  is  ever- 
changing  as  there  are  more  obligations  than  ever  before  and  fewer  resources. 
There  is  a growing  interest  to  motivate  volunteers  and  increase  the  efficiency 
and  social  impact  of  monitoring. 

The  European  Union  has  rapidly  expanded  and  many  new  directives  have 
come  into  force  influencing  environmental  monitoring.  The  EU  legislation  has 
had  an  impact  on  the  national  legislation  of  all  Member  States  and  the  daily 
lives  of  people  living  in  the  EU.  Finland  joined  the  EU  in  1995,  and  today  direc- 
tives are  influencing  monitoring  and  their  data  dissemination  practices.  New 
monitoring  projects,  such  as  beach  littering,  have  been  established,  and  global 
obligations  require  the  implementation  of  the  INSPIRE  Directive,  which  is 
based  on  the  infrastructures  for  spatial  information. 

In  the  past  ten  years  we  have  seen  an  immense  increase  in  the  number  and  vol- 
ume of  citizen  science  projects.  The  Internet,  sensor  technology  and  smart  phones 
have  made  it  easy  to  record  observations  with  stamps  on  position  and  time,  and 
the  communication  with  data  has  become  quick  and  easy.  However,  the  partici- 
pation of  non- scientists  in  scientific  research  and  data  collection  is  not  a new 
phenomenon.  Well  into  the  19th  century  it  was  possible  for  non-professional  sci- 
entists to  contribute  remarkably  to  scientific  research  (Haklay  2015).  Significant 
scientific  figures,  such  as  Charles  Darwin  or  Robert  Boyle,  may  be  considered,  by 
todays  standards,  to  have  been  non-professional  scientists  (Shapin  1994). 

Urbanisation  began  relatively  late  in  Finland  and  the  process  has  been  more 
rapid  than  in  other  European  countries  (Heikkila  2003).  The  migration  of  peo- 
ple from  rural  areas  to  towns  has  reduced  the  potential  of  a voluntary  work- 
force. Aging  is  another  challenge,  with  the  number  of  people  over  60  being 
over  four  times  higher  than  it  was  in  1900  (Official  Statistics  of  Finland  2016). 
A whole  new  set  of  options  have  been  used  to  solve  challenges.  The  hydrologi- 
cal monitoring  service  is  currently  using  new  technologies  and  has  increased 
marketing  effort  to  recruit  new  observers.  Modern  monitoring  projects,  such 
as  the  algal  watch,  the  Lake&Seawiki  and  jellyfish,  have  utilised  new  mobile 
sensor  technologies  as  well  as  social  networking  and  processes  to  mitigate  the 
impacts  of  drivers.  Different  monitoring  projects  seem  to  have  different  domi- 
nating drivers,  as  shown  in  the  following  table. 
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Globalisation 

Grown 

obligations 

Urbanisation 

Ageing 

Hydrology 

0 

0 

- 

- 

Birds 

+ 

+ 

0 

- 

Game  animals 

0 

+ 

- 

- 

Algal  watch 

+ 

+ 

0 

0 

Lake&Seawiki 

+ 

+ 

0 

0 

Jellyfish 

0 

0 

0 

0 

Beach  litter 

+ 

+ 

0 

0 

Table  1:  Impacts  of  different  drivers  (+,  0,  positive,  neutral,  negative)  on 
selected  monitoring  projects  in  Finland. 


Selected  monitoring  projects 

Hydrological  monitoring  - In  service  of  infrastructure 

Local  volunteers  were  actively  recruited  into  hydrological  monitoring  already 
in  the  beginning  of  the  20th  century  when  the  predecessor  of  the  national 
hydrological  service  in  Finland  was  initiated.  The  observers  received  small 
premiums  for  their  services,  but  their  key  motivation  was  their  own  interest 
in  hydrological  phenomena  (Kuusisto  2008a).  For  more  than  a hundred  years 
these  amateur  station  agents  completed  important  observations  on  hydrologi- 
cal parameters  such  as  water  level,  frost,  snow  thickness,  water  equivalent  and 
ice  cover. 

Observers  have  usually  been  engaged  with  monitoring  activities  for  a long 
period  of  time,  with  some  monitoring  sites  having  been  managed  by  the  same 
family  since  the  1910s.  In  the  beginning,  the  number  of  observed  parameters 
was  large.  Between  1913  and  1931,  the  volunteers  also  collected  samples  on 
water  quality.  There  were  200  stations  for  the  analysis  of  transparency  and  light 
attenuation  coefficients  at  seven  different  wave  lengths  (Kuusisto  2008b),  and 
100  locations  in  which  samples  for  organic  and  inorganic  suspended  sedi- 
ment, dissolved  inorganic  and  organic  matters,  alkalinity  and  dissolved  oxygen 
demand  were  collected.  The  intensive  programme  continued  until  1932,  when 
the  Great  Depression  forced  the  closure  of  many  monitoring  programmes. 
However,  the  hydrological  network  started  after  the  Second  World  War. 

The  national  hydrological  monitoring  is  an  example  of  the  volunteers  being 
very  efficiently  organised.  From  the  beginning  the  system  has  taken  into  account 
the  role  of  volunteers  as  a part  of  the  entire  system.  Today,  a remarkable  part  of 
the  hydrological  monitoring  is  automated,  but  the  network  of  over  300  observers 
covers  the  entire  country  and  is  closely  developed  to  support  real-time  forecast- 
ing systems  (Ymparisto  2015).  Trained  observers  take  care  of  the  pre-designed 
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network  of  monitoring  sites.  This  makes  the  system  reliable  and  cost-efficient 
even  in  remote  areas.  The  system  is  facing  a great  challenge  as  many  of  the 
elderly  volunteers  have  stepped  down  and  more  marketing  efforts  are  needed  in 
order  to  recruit  new  amateurs  to  replace  them,  especially  in  rural  areas. 


Birds  - Academic  interest  and  hobby 

In  Finland  regular  bird  monitoring  started  in  the  early  20th  century.  Though 
the  main  motivation  for  this  was  academic,  safeguarding  some  of  the  dimin- 
ishing bird  populations  was  another  goal  as  well.  Up  to  the  1970s,  voluntary 
birdwatchers  were  mainly  skilled  amateur  naturalists  living  in  urban  areas. 
A central  organisation  for  regional  societies  was  established  in  the  middle  of 
the  1970s  and  the  Bird  Atlas  project  started  activating  birdwatchers  in  rural 
areas  (Santaoja  2013).  The  central  organisation  joined  the  global  network  of 
bird  organisations,  BirdLife  International,  in  1992.  Since  then  the  number  of 
birdwatchers  has  increased  remarkably  and  today  involves  about  12,000  bird- 
watchers (Birdlife  2015). 

Until  the  1970s,  the  partnership  between  the  birdwatchers  and  the  Museum 
(Natural  History  Museum  of  Finland)  was  very  tight.  Since  then  the  Museum 
has  concentrated  on  programmed,  traditional  bird  monitoring  and  left  the  col- 
lection and  filing  of  other  voluntary  observations  to  the  NGOs  (regional  orni- 
thological societies  and  BirdLife  Finland.  They  have  maintained  and  collected 
the  data  since  then  and  share  it  in  the  BirdLife  database,  which  is  open  to  all  reg- 
istered users.  Both  programmed  monitoring  and  other  data  are  collected  from 
birdwatchers. 

The  majority  of  the  official  bird  monitoring  is  now  coordinated  by  the 
Museum.  It  is  based  on  the  work  of  voluntary  birdwatchers  who  have  commit- 
ted to  a regular  monitoring,  following  the  given  instructions  on  when,  where 
and  how  to  monitor  certain  bird  species  or  an  area.  The  monitoring  data  is 
used  in  the  European  Bird  Census  Councils  (EBCC)  various  projects,  e.g.  in 
the  assessment  of  Pan-European  Common  Bird  Indices.  The  monitoring  data 
and  the  random  observations  of  birdwatchers  are  made  available  as  part  of 
the  Global  Biodiversity  Information  Facility  (GBIF),  which  aims  to  make  the 
world  s scientific  biodiversity  data  freely  and  universally  available  via  the  Inter- 
net for  the  benefit  of  science,  society  and  a sustainable  future. 

The  reasons  for  increasing  birdwatching  are  manifold.  Today  the  literature 
and  other  methods  for  the  identification  of  species  are  very  developed.  Besides 
strengthening  regional  societies,  BirdLife  has  been  active  in  publishing  and 
communications  and  it  has  introduced  a large  set  of  new  activities,  such  as 
the  Big  Garden  Birdwatch  in  January,  which  is  aimed  for  all  citizens,  and  a 
competition  called  the  Battle  of  the  Bird  Towers  in  May,  which  is  a competition 
mainly  for  birdwatchers.  The  share  of  women  has  grown,  and  bird  watching  is 
nowadays  also  a hobby  for  families.  The  current  number  of  bird  observations 
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collected  by  BirdLife  and  its  member  societies  is  annually  over  1 million  in  Fin- 
land. Bird  monitoring  is  a global  programme  that  has  also  managed  to  attract 
many  observers  from  rural  areas. 

Game  animals  - Hobby  and  co-management 

Official  monitoring  of  the  population  abundance  of  game  has  been  carried  out 
in  cooperation  between  the  Ministry  of  Agriculture  and  Forestry,  the  Finn- 
ish Game  and  Fisheries  Research  Institute  and  hunter  organisations  since  the 
1970s.  The  objective  of  the  data  collection  is  to  produce  a scientific  foundation 
for  sustainable  hunting.  Researchers  plan  the  census  and  organise  it  together 
with  hunter  organisations,  volunteer  hunters  do  the  actual  fieldwork  (Rktl 
2015a)  and  the  public  research  institute  conducts  the  analyses  and  reports  to 
the  national  and  local  administrations. 

Different  game  species  groups  have  their  own  monitoring  programmes.  For- 
est and  mixed-forest  agricultural  game  species  are  monitored  through  wildlife 
triangle  schemes.  More  than  30  forest  game  species  are  monitored.  The  popula- 
tion and  breeding  success  of  waterfowl  are  monitored  with  pair  counts  in  May 
and  brood  counts  in  July.  Specific  census  methods  have  been  developed  for 
moose,  large  carnivores,  seals,  beavers  and  wild  forest  reindeer. 

It  is  important  to  note  that  in  Finland  about  300,000  citizens  pay  the  annual 
hunting  management  fee,  and  the  number  has  doubled  between  1960  and  2000 
(Saarsalmi  et  al.  2014).  In  1960  a majority  of  the  hunters  owned  their  hunting 
land,  but  today  60  per  cent  hunt  on  rented  land  of  their  hunting  club  or  on  gov- 
ernment land.  A large  proportion  of  the  landless  hunters  live  in  towns.  Accord- 
ing to  the  enquiry  (Rktl  2015b)  the  total  amount  of  active  voluntary  work  by 
hunters  in  Finland  in  2008  comprised  290  man-years.  Hunters  were  on  standby 
for  a total  of  1,800  man-years  to  assist  in  moose,  white-tailed  deer  and  large 
game  animal  emergencies,  like  traffic  accidents. 

An  estimated  total  of  40,000  hunters  participated  in  voluntary  work  in  2008. 
The  value  of  the  voluntary  work  without  overheads  was  estimated  at  7. 1 million 
euros.  The  Game  Management  Associations  estimated  that  some  20,000  people 
performed  voluntary  work  in  game  monitoring.  Voluntary  work  in  nationwide 
game  monitoring  schemes  was  estimated  to  be  89  man-years,  of  which  obser- 
vation of  large  carnivores  made  up  40  man-years.  Hunters  covered  around 
900,000  km  with  their  cars  to  carry  out  these  nationwide  game  monitoring 
schemes.  Even  though  there  have  been  more  obligations  for  volunteers  in  Fin- 
land, these  challenges  have  been  overcome  via  good  cooperation. 


Algal  Watch  - Supporting  other  sources  ofinformation 

Algal  watch  was  initiated  in  1998  to  better  inform  the  public  about  blue-green 
algal  blooms  in  the  Northern  Baltic  Sea.  In  2000,  the  first  group  of  trained 
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volunteers,  sea  scouts,  started  monitoring  coastal  and  archipelago  areas  in 
Southern  Finland  (Rapala  et  al.  2012).  The  work  supplemented  the  data  col- 
lected by  commercial  ferries  in  the  Alg@line  network  (Finmari  2015)  estab- 
lished in  1995  (Rantajarvi  et  al.  2003).  The  aim  of  the  network  was  partly 
educational,  but  it  also  set  in  a practice  for  citizen  monitoring  on  blue-green 
algal  blooms,  bladder  wrack  and  water  transparency.  Later  on,  the  system  was 
extended  and  since  2011  citizens  have  been  able  to  report  their  observations 
on  blue-green  algal  blooms,  bladder  wrack  density  and  Secchi  depth  by  using  a 
mobile  phone  application  called  Algae  watch  (Mmea  2015). 

The  application  includes  instructions  and  stamping,  i.e.  registration  of  posi- 
tion and  timing  of  the  observation  is  done  by  the  GPS  of  the  phone.  It  is  also 
possible  to  take  a photograph  and  send  it  simultaneously  with  the  observation. 
During  the  performed  pilot  trials,  no  service  misuse  was  detected  (Kotovirta 
et  al.  2014). 

Users  of  the  application  were  motivated  by  sending  them  a notification  of  the 
algae  situation  and  reminders  to  contribute  to  observations  in  the  future  too. 
From  the  user  data  it  can  be  seen  that  citizen  activity  decreased  towards  the  end 
of  the  summer,  although  blooms  were  still  present.  This  is  most  likely  due  to  the 
timing  of  summer  holidays  in  Finland,  which  usually  end  by  the  beginning  of 
August  (Kotovirta  et  al.  2014). 


Lake&Seawiki  - modern  tools  and  social  networking 

Lake&Seawiki  (Jarviwiki  2015)  is  a wiki  service  about  Finnish  lakes  and  coastal 
sea  areas.  The  concept  was  developed  in  the  Finnish  Environment  Institute 
(SYKE)  to  promote  peoples  engagement  in  the  protection  and  monitoring  of 
their  nearby  waters  and  to  allow  non-professionals  to  upload  observations  on 
water  temperature,  ice  situation,  algal  blooms  etc.  The  service  has  been  running 
since  2011  and  anyone  can  contribute.  The  users  of  the  service  can  take  part  in 
discussions  and  maintain  their  own  observation  sites.  Service  is  well  marketed 
and  currently  the  first  that  is  managed  by  the  communication  department  of 
SYKE. 

The  service  is  running  on  open  source  software  and  has  low  operation  costs. 
The  moderation  and  upgrades  of  software  require  one  person-month  a year. 
Furthermore,  an  office  hour  helpdesk  is  needed  in  June- August.  In  2014, 
almost  280,000  users  visited  the  service,  with  200  contributing  during  sum- 
mer months  and  30  during  winter  months.  They  produce  more  than  9,000 
observations  annually.  The  number  of  visitors  has  increased  annually  by  25% 
(Kettunen  et  al.  2014). 

Still,  a large  part  of  people  contributing  to  the  service  are  citizens  who  have 
earlier  been  recording  their  observations  in  their  private  notebooks.  The  ser- 
vice has  given  them  a platform  and  acts  as  an  archive  and  a visualizer  for  their 
observations.  Some  time  series  clearly  show  the  impact  of  climate  change  on 
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ice  breakup.  Measurements  received  from  experienced  amateur  observers  are 
generally  of  good  quality  and  thus  complement  nicely  the  other  parts  of  the 
monitoring  system.  The  service  started  receiving  observations  from  desktop 
computers.  Today,  the  share  of  mobile  phone  and  tablet  users  is  growing  and  is 
expected  to  speed  up  the  growth  of  the  service. 


Jellyfish  - Abundance  unveiled  for  the  1st  time 

In  2010  the  Finnish  Environment  Institute  started  to  collect  jellyfish  observa- 
tions from  the  public  to  determine  jellyfish  distribution  in  Finnish  waters  and 
the  factors  affecting  the  bloom  formation  in  late  summer.  The  proposal  for 
the  monitoring  came  from  the  energy  industry,  which  needed  data  on  jelly- 
fish abundances  and  predictions  on  conditions  in  which  jellyfish  form  blooms, 
since  these  can  affect  power  plants  by  possibly  clogging  the  cooling  water 
intake  pipes. 

Using  only  public  observations,  the  monitoring  of  jellyfish  distribution  and 
abundance  has  now  been  ongoing  for  5 years.  Citizens  have  reported  their 
observations  via  a web  form,  which  is  planned  to  be  developed  into  a mobile 
application  in  the  near  future.  In  the  form  citizens  are  asked  to  estimate  the  jel- 
lyfish abundance  on  a two-level  scale  (few,  clearly  less  than  <20  individuals  m 2 
or  a bloom,  >20  individuals  nr2),  name  the  sea  area  where  the  observation  was 
done,  and  if  it  was  a coastal  or  an  open  sea  observation.  Wind  and  temperature 
estimates  are  also  asked  in  the  form. 

This  voluntary  citizen  science  monitoring  has  revealed  patterns  in  jellyfish 
abundance  and  distribution  range  in  Finnish  waters  which  would  have  other- 
wise been  impossible  to  obtain.  Further,  the  reported  observations  have  also 
been  used  as  a service  when  communicating  with  the  power  plants  and  other 
industry  on  the  blooms  close  to  their  seawater  intake  areas.  The  challenge  of 
this  kind  of  monitoring  is  that  it  is  highly  dependent  on  the  press  releases 
informing  citizens  that  observations  are  still  (and  continuously)  needed.  With- 
out advertisements  the  number  of  observations  is  much  smaller. 


Beach  litter  - Global  outsourced  monitoring 

Beach  litter  monitoring  with  the  help  of  citizens  started  in  Finland  in  2012, 
when  groups  from  Sweden,  Finland,  Estonia  and  Lithuania  became  partners  in 
an  EU-funded  project.  The  aim  was  to  implement  the  harmonised  method  for 
the  first  time  around  the  Baltic  Sea.  The  method  used  was  based  on  a slightly 
modified  UNEP  protocol  on  beach  litter  survey  (Cheshire,  Adler  & Barbiere 
2009).  The  length  of  the  beach  had  to  be  adjusted  to  central  Baltic  conditions, 
as  well  as  the  timing  of  surveys.  The  survey  was  implemented  by  the  Finnish 
partner  in  the  project:  a local  NGO  (Keep  the  Archipelago  Clean). 
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For  the  survey  to  be  successful,  there  has  to  be  a dedicated  group  of  people 
(at  least  two,  but  no  more  than  10)  responsible  for  the  survey  during  a certain 
time  of  the  year  on  a certain  beach.  Each  group  has  a contact  person  who  col- 
lects the  data  and  delivers  it  to  the  NGO  in  charge  of  the  survey  In  Finland  the 
contact  persons  were  people  such  as  school  teachers.  Experience  from  the  field 
has  shown  that  the  best  commitment  has  come  from  schools,  where  the  survey 
can  be  included  as  a part  of  environmental  education. 

The  survey  was  a success  in  many  ways,  and  the  project  was  able  to  combine 
comparable  data  from  macroscopic  litter  in  different  Baltic  countries.  The  pro- 
ject received  a lot  of  attention  in  national  media,  partly  because  Finland  was  cat- 
egorised as  the  most  littered  country  of  the  project.  Some  need  of  development/ 
improvement  was  also  noted  during  the  survey  by  the  organiser.  The  type  and 
location  of  the  beach  that  will  be  surveyed  has  to  be  very  carefully  chosen.  It 
is  especially  important  when  comparisons  between  countries  are  made.  Each 
geographical  area  should  have  representative  beaches  from  all  beach  categories: 
rural,  urban  or  in  between.  It  is  especially  important  to  identify  what  pressures 
are  causing  littering  on  the  survey  beach  so  that  management  is  targeted  cor- 
rectly. Geographical  expertise  combined  with  local  information  of  water  cur- 
rents, upwelling  areas  and  other  hydrographical  aspects  that  may  have  an  effect 
on  the  distribution  of  litter  should  also  be  included  in  the  planning  phase. 

By  using  local  citizens,  the  beach  litter  survey  has  proven  to  work  so  well 
that  it  is  presently  included  in  the  Finnish  monitoring  plan  for  beach  litter.  The 
collaboration  with  the  NGO  is  continuing,  and  new  areas  for  monitoring  are 
planned  together  with  authorities. 


Lessons  learned 
Society  under  change 

There  are  some  drivers  in  society  that  have  had  a strong  influence  on  monitor- 
ing during  the  last  few  decades.  One  is  urbanization,  i.e.  migration  of  people 
from  rural  areas  to  population  centres.  In  Finland,  this  started  as  late  as  the 
1960s,  but  the  impact  was  stronger  with  the  delay.  Together  with  the  rapid  age- 
ing of  the  Finnish  population,  it  has  gradually  diminished  the  potential  of  get- 
ting new  local  volunteers  for  traditional  hydrological  and  game  animal  moni- 
toring. The  lack  of  voluntary  labour  has  worsened  the  situation  with  regards 
to  tasks  that  require  a constant  standby  or  presence  near  the  rural  observa- 
tion sites.  Contrary  to  these  continuous  monitoring  tasks,  short-term  moni- 
toring efforts,  such  as  wildlife  triangle  schemes,  have  revived  again  after  years 
of  downturn.  This  is  explained  by  the  intensive  campaigns  of  hunter  organi- 
sations. Short-term  tasks  are  also  better  suited  to  the  urban  life-rhythm.  The 
recent  digitalisation  of  information  systems  has  also  made  it  possible  to  base 
the  regulation  of  hunting  on  real-time  data,  which  has  encouraged  voluntary 
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monitoring  work.  Unlike  the  other  traditional  monitoring  systems,  birdwatch- 
ing schemes  seem  to  have  benefitted  from  urbanisation.  This  has  mainly  fol- 
lowed from  the  unification  and  decentralisation  of  birdwatch  organisations  in 
the  mid’  1970s  (Santaoja  2013).  The  organisations  have  also  been  able  to  tackle 
the  problem  of  ageing  by  developing  new  types  of  operational  models  based  on 
ideas  of  competitions  and  other  kinds  of  gamification,  for  example.  Also,  the 
educational  and  communicational  material  and  tools  supporting  the  birdwatch 
have  greatly  improved. 


Geographical  scale  and  stage  in  policy  formation  have  changed 

For  the  past  20  years,  a strong  driver  changing  both  the  ecological  and  envi- 
ronmental monitoring  has  been  globalisation.  It  has  changed  the  geographical 
scale  of  monitoring.  After  Finland  joined  the  EU,  the  top-down  regulations  and 
the  number  of  legal  obligations  have  grown  tremendously.  Between  2000  and 
2015,  the  existing  ecological  and  environmental  monitoring  was  redesigned  to 
be  compatible  with  EU  regulations,  which  has  also  somewhat  changed  the  citi- 
zen science  in  Finland.  After  2008,  however,  the  national  economy  has  weak- 
ened and  the  environmental  administration  has  been  forced  to  reduce  costs 
and  reconsider  the  entire  monitoring  system.  In  the  latest  strategy,  the  Ministry 
of  Environment  (2011)  has  indicated  new  guidelines  for  environmental  moni- 
toring in  2020.  According  to  the  guidelines,  the  imperatives  are  to  reduce  costs 
while  improving  timeliness  and  usability.  The  tools  suggested  by  the  strategy 
are  automation  and  digitalisation,  remote  sensing,  increased  use  of  applications 
of  citizen  science  and  increased  co-operation  between  the  public,  private  and 
voluntary  sectors. 


Operational  models  and  the  depth  of  engagement  are  different 

Before,  the  traditional  monitoring  systems  were  top-down  oriented.  The  vol- 
unteers were  given  a task  to  collect  observations,  briefed  on  the  phenomenon 
and  given  forms  to  fill.  Today,  the  operational  models  are  more  diverse  and  the 
depth  of  the  engagement  of  volunteers  varies  (Haklay  2015).  We  see  it  in  the 
birdwatch  and  game  monitoring.  Competitions  and  campaigns  have  remark- 
ably increased  the  participation.  However,  the  activity  of  the  hobbyists  is  not 
constant.  For  example,  during  2006-2010  57  per  cent  of  the  1.2  million  bird 
observations  were  made  by  100  so-called  superobservers.  The  first  years  of 
Lake&Seawiki  have  shown  that  people  are  most  active  in  making  observations 
and  participating  in  wiki  discussion  while  on  vacation.  In  the  beach  litter  watch 
we  have  seen  that  volunteers  easily  grow  tired  of  observing  if  they  do  not  have 
some  additional  motive. 

All  monitoring  systems  and  the  citizen  science  in  them  have  their  own  life 
cycles.  It  seems  probable  that  hydrological  monitoring  in  its  traditional  form 
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will  be  substituted  by  automation.  However,  it  seems  just  as  probable  that 
Lake&Seawiki  has  also  brought  new  means  for  extending  the  geographical  cov- 
erage of  snow  and  ice  observations.  Earlier,  we  designed  our  monitoring  on  a 
national  basis.  Our  new  systems  are  more  generic  and  can  also  be  taken  into 
use  internationally. 
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Abstract 

Land-related  inventories  are  important  sources  of  geoinformation  for  environ- 
mentalists, researchers,  policy-makers,  practitioners,  and  ecologists.  Tradition- 
ally, a considerable  amount  of  energy,  time,  and  money  have  been  dedicated 
to  map  global/regional/local  land  use  datasets.  While  remote  sensing  images 
and  techniques  along  with  field  surveying  have  been  the  main  sources  of  data 
for  determining  land  use  features,  field  measurements  of  ground  truth  have 
always  amplified  the  required  time  and  money,  as  well  as  information  credibil- 
ity. Nowadays,  volunteered  geographic  information  (VGI)  has  shown  its  great 
contributions  to  different  scientific  disciplines.  This  was  made  possible  thanks 
to  Web  2.0  technologies  and  GPS-enabled  devices,  which  have  advanced  citi- 
zens knowledge-based  projects  and  made  them  user-friendly  for  volunteered 
citizens  to  collect  and  share  their  knowledge  about  geographical  objects.  Open- 
StreetMap  as  one  of  those  leading  VGI  projects  has  shown  its  great  potential 
for  collecting  and  providing  land  use  information.  The  collaboratively  collected 
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land  use  features  from  diverse  citizens  could  greatly  back  up  the  challenging 
element  of  land  use  mapping,  which  is  in-field  data  gathering.  Hence,  in  this 
literature  we  will  look  at  the  completeness,  thematic  accuracy  and  fitness  for 
use  of  OpenStreetMap  features  for  land  mapping  purposes  over  European 
countries.  The  empirical  findings  reveal  that  the  degree  of  completeness  varies 
widely  ranging  from  2%  to  96%  and  overall  and  per-class  thematic  accuracies 
goes  up  to  80%  and  96%,  respectively  compared  to  the  European  GMESUA 
datasets.  Furthermore,  more  than  50%  of  land  use  features  of  eight  European 
countries  are  mapped.  This  messages  that  the  harnessing  citizens  knowledge 
can  play  a great  role  in  land  mapping  as  an  alternative  and  complementary  data 
source. 


Keywords 

Land  use  mapping;  Comparative  assessment;  Global  Monitoring  for  Environ- 
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Introduction 

Land  cover  (LC)  and  land  use  (LU)  inventories  contain  geoinformation  on  the 
coverage  and  usage  of  our  surrounding  lands,  respectively.  LU  and  LC  inven- 
tories are  of  high  importance  for  many  applications  with  regards  to  urban  and 
regional  planning,  policy  making,  among  others.  These  two  concepts  present 
two  distinctive  concepts,  because  LU  maps  explain  human  activities  happening 
on  the  land,  such  as  artificial  surface  construction,  farming,  and  forestry  that 
represent  the  usage  of  land  (Ellis  2007;  Wastfelt  & Arnberg  2013),  while  LC 
maps  present  the  physical  cover  on  the  ground  (De  Sherbinin  2002).  Tradition- 
ally, applying  image  processing  algorithms  on  remotely  sensed  data  elaborated 
with  ground-truth  measurements  and  other  complementary  archive  data  have 
been  the  main  source  of  collecting  LU  and  LC  features  (Qi,  Yeh,  Li  & Lin  2012; 
Saadat  et  al.  201 1).  Although  remote  sensing  images  and  techniques  often  facili- 
tate earth  observation  efforts,  in-field  surveying  as  well  as  personal  interviews 
with  local  residents  are  required  for  the  sake  of  results  validation,  i.e.  as  ground- 
truth  data  coming  from  in- situ  measurements  play  a critical  role  in  delivering 
end  products  (Cihlar  & Jansen  2001;  De  Leeuw  et  al.  2011).  Therefore,  we  have 
to  collect  ancillary  data  as  well  in  order  to  assign  appropriate  LU  types  to  land 
parcels.  As  a result,  LU  mapping  becomes  even  more  complicated  than  LC  map- 
ping, and  extensive  data  collection  from  local  citizens,  land  managers,  and  evi- 
dence sources  are  vital  for  accurate  LU  mapping  (Fritz  et  al.,  2012). 

From  financial  and  temporal  perspectives,  a great  deal  of  budget  and  time 
have  been  dedicated  for  producing  LU  and  LC  maps  at  global,  regional,  and 
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local  scales.  Examples  of  global  and  regional  scale  with  coarse  resolution  prod- 
ucts include  Global  Land  Cover  (GLC)-2000  (Fritz  et  al.,  2003),  Moderate- 
resolution  Imaging  Spectroradiometer  (MODIS  (Mclver  8c  Friedl,  2002)), 
and  GlobCover  (Arino  et  al.,  2012),  CORINE  2000  (Biittner,  Feranec,  8c 
Gabriel,  2002)  and  Global  Monitoring  for  Environment  and  Security  Urban 
Atlas  (GMESUA  (Seifert,  2009))  among  others.  In  the  case  of  GMESUA,  high- 
resolution  images  including  SPOT,  RapidEye,  and  ALOS  Images  have  been 
utilized  to  generate  fine-scale  maps  of  large  metropolitan  areas  delivering 
GMESUA  (Kong,  Yin,  Nakagoshi,  8c  James,  2012).  But,  the  accuracy  of  them 
has  been  the  main  concern  as  outlined  by  (Fritz  et  al.,  2012;  Herold,  Mayaux, 
Woodcock,  Baccini,  8c  Schmullius,  2008).  Thus,  the  necessity  of  having  an 
alternative  and  complementary  solution  for  mapping  LU  and  LC  features  is 
evident.  We  believe  that  VGI  could  be  of  great  importance,  because  the  devel- 
opment of  web  technologies  and  large  availability  of  GPS-enabled  devices  have 
resulted  in  the  emergence  of  a large  number  of  VGI  platforms,  which  provide 
information  about  geographical  objects  from  citizens  (Fonte,  Bastin,  See, 
Foody,  8c  Lupia,  2015).  The  majority  of  the  VGI-like  platforms  offer  very  high- 
resolution  satellite  and  aerial  images  (from  20  cm  spatial  resolution)  through 
image  libraries  (e.g.  Bing  Maps)  in  their  interfaces,  which  enable  volunteers  to 
visualize  the  whole  globe  with  high  detail  so  that  they  can  map  a large  variety 
of  features  and  attach  respective  attributes  to  them  (Rouse,  Bergeron,  8c  Harris, 
2007).  In  other  words,  a sort  of  visual  analysis  and  interpretation  of  satellite 
images  is  applied.  This  convenient  and  straightforward  way  of  visual  interpre- 
tation of  remote  sensing  images  can  be  considered  as  an  alternative  solution 
for  LU  mapping  and  even  achieving  finer  resolution  LU  maps  than  our  current 
stored  datasets  at  a global  scale  (Jokar  Arsanjani,  Mooney,  Helbich,  8c  Zipf, 
2015).  Undoubtedly,  OSM  has  been  a pioneer  example  of  VGI  and  has  shown 
its  huge  potential  for  being  the  Wikipedia  of  maps  exactly  as  its  motto.  OSM  is 
a unique  platform  for  several  reasons  namely,  it  has  attracted  a huge  amount 
of  public  attention  and  contributions  (Ramm  et  al.,  2011)  by  having  exceed- 
ing 2.3  million  users  until  today  and  continues  to  grow  as  outlined  by  Jokar 
Arsanjani,  Helbich,  et  al.  (2015).  More  importantly,  OSM  is  highly  democratic 
in  receiving  contributions  through  enabling  any  volunteer  to  add/edit/mod- 
ify  the  existing  features  and  sharing  the  whole  data  history  freely  and  openly 
with  the  public  in  a structured  way  (Flanagin  8c  Metzger,  2008;  Koukoletsos, 
Haklay,  8c  Ellul,  2012).  Moreover,  OSM  collects  geographic  information  in  the 
form  of  GIS  vector  data  such  as  points,  polylines,  and  polygons  and  releases 
them  based  on  different  tags,  which  makes  it  quite  user-friendly  for  end  users 
(Jokar  Arsanjani,  Helbich,  Bakillah,  Hagenauer,  8c  Zipf,  2013;  Jokar  Arsanjani, 
Mooney,  Helbich,  et  al.,  2015). 

An  extensive  amount  of  analysis  of  road  networks  in  OSM  has  been  carried 
out  (Ludwig,  Voss,  8c  Krause-Traudes,  2011;  Mooney  8c  Corcoran,  2012)  and  a 
few  attempts  in  analyzing  OSM  for  LU  mapping  has  been  conducted.  We  will 
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assess  the  role  of  OSM  in  LU  and  LC  mapping.  Besides  preparing  a LU  dataset 
from  OSM  contributions,  we  aim  at  a)  measuring  the  completeness  of  OSM 
LU  features,  b)  cross-comparing  the  thematic  accuracy  of  the  OSM  LU  features 
with  the  GMESUA  data  through  a statistical  assessment,  c)  assessing  the  fitness 
for  use  of  OSM  for  LU  and  LC  mapping. 


Materials 
OSM  dataset 

A snapshot  of  OSM  features  tagged  as  natural’  and  ‘landuse’  from  Novem- 
ber 2013  and  February  2014  was  collected.  The  features  tagged  with  natural’ 
describe  a wide  variety  of  physical  features,  which  are  categorized  into  different 
categories  such  as  water  bodies,  forest,  etc.  as  described  in  (Ramm,  2014).  The 
term  landuse’  concerns  the  human  use  of  land,  which  represents  the  purpose  a 
land  parcel  is  being  used  for. 


Reference  dataset 

In  this  study,  the  pan-European  GMESUA  dataset  serves  as  reference  data, 
which  comprises  LU  data  for  selected  metropolitan  areas  exceeding  100,000 
inhabitants.  It  is  prepared  for  European  needs  and  the  contained  informa- 
tion has  been  derived  mainly  from  Earth  Observation  (EO)  data  supported 
by  other  reference  data  including  commercial-off-the-shelf  (COTS)  naviga- 
tion data  and  topographic  maps.  It  has  a minimum  mapping  unit  (MMU)  of 
0.25-1  ha,  and  a minimum  width  of  linear  elements  of  100  m with  ± 5 m 
positional  accuracy  (European  Union,  2011).  It  currently  covers  305  urban 
regions  within  Europe.  The  minimum  thematic  accuracy  for  all  classes  is  80%. 
For  more  details  see  the  Urban  Atlas  mapping  guide  (European  Union,  2011). 
Table  1 represents  the  defined  classes  and  their  codes  in  GMESUA  at  different 
levels  of  details. 


Study  areas 

In  this  study,  the  whole  European  continent  was  chosen  as  the  study  area  for 
the  regional  scale  analysis  and  ten  random  metropolitan  areas  were  selected 
as  case  studies  for  the  local  scale  analysis.  These  cities  including  their  metro- 
politan areas  are  Berlin,  Frankfurt  am  Main,  Munich,  and  Hamburg,  Bucha- 
rest, Rome,  Stockholm,  London,  Budapest,  and  Vienna.  Having  multiple  case 
studies  from  different  countries  would  help  to  understand  the  heterogeneity  of 
contributions  in  terms  of  quantity  and  quality. 
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Table  1:  Classification  scheme  applied  in  the  preparation  of  GMESUA  datasets  as  outlined  in  European  Union  (2011). 
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Methods 

Quality  of  geodata  should  be  considered  internally  and  externally  (Gervais, 
Bedard,  Levesque,  Bernier,  & Devillers,  2009;  Jokar  Arsanjani,  Barron,  Bakil- 
lah,  Helbich,  & Arsanjani,  2013;  van  Oort,  2006).  Internal  quality  reflects  the 
specifications  in  the  process  of  data  production  that  address  errors  in  the  data. 
External  quality  measures  the  suitability  of  a dataset  for  a particular  purpose 
and  addresses  its  ‘Fitness  of  Use’  (FoU:  (Devillers,  Bedard,  Jeansoulin,  & Mou- 
lin, 2007;  Guptill  & Morrison,  1995)).  The  major  standard  organizations  (e.g. 
ISO,  ICA,  FGDC,  and  CEN)  have  described  their  main  criteria  for  data  quality 
analysis  and  the  following  five  criteria  are  common  amongst  them:  (1)  com- 
pleteness, (2)  positional  accuracy,  (3)  thematic  accuracy,  (4)  temporal  accuracy, 
and  (5)  logical  consistency  (Guptill  & Morrison,  1995).  In  this  study,  two  major 
aspects  of  internal  data  quality  namely  completeness  and  thematic  accuracy  are 
considered  and  their  external  use  is  discussed. 

Following  Figure  1,  first,  OSM  features  tagged  with  ‘landuse’  and  ‘natural’  are 
retrieved  and  merged  together  into  a unique  dataset.  Second,  overlaps  and  top- 
ological errors  in  the  dataset  are  then  resolved  to  assure  the  logical  consistency 
of  features.  Third,  the  OSM  features  are  re-classified  and  matched  according  to 
the  GMESUA  nomenclature.  Fourth,  the  percentage  of  completeness  for  each 
country/city  is  determined  to  measure  how  complete  a certain  city  is  mapped. 
Finally,  an  error  matrix  between  the  OSM  and  GMESUA  datasets  is  computed 
to  measure  the  overall  thematic  accuracy  of  the  OSM  features  along  with  a 
detailed  per-class  analysis. 


Results  and  discussions 
Completeness 
Regional  (European)  scale 

Figure  2 represents  the  measured  completeness  indices  across  European  coun- 
tries. This  is  calculated  based  on  the  total  mapped  area  in  each  country  relate  to 
total  area  of  the  corresponding  country.  The  values  are  diverse.  While  only  1.6% 
of  land  use  features  in  Iceland  are  mapped,  96%  of  Bosnia  and  Herzegovina  are 
mapped. 

More  than  half  of  Belgium,  Bosnia  & Herzegovina,  Germany,  France,  Lux- 
emburg, the  Netherlands,  Romania,  and  Slovakia  are  mapped.  Spatial  distribu- 
tion of  the  mapped  features  within  Europe  is  displayed  in  Figure  3 by  green 
cells.  It  should  be  noted  that  considering  European  countries  with  dissimilar 
population  and  physical  patterns,  these  completeness  values  should  not  be  used 
for  judging  the  topology  of  citizen  participations  in  OSM.  For  instance,  Ice- 
land with  an  area  of  103,000  km2  and  nearly  300,000  inhabitants  is  the  least 
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1:  The  flowchart  of  evaluating  OSM  land  use  features. 
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Figure  2:  The  calculated  completeness  index  of  OpenStreetMap  land  use  fea- 
tures for  European  countries. 
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Figure  3:  Spatial  distribution  of  land  use  features  from  OpenStreetMap  in 
Europe. 


mapped  country,  which  is  not  comparable  with  the  Netherlands,  holding  an 
area  of  41,500  km2  and  nearly  17  million  inhabitants,  corresponding  to  one  the 
best  mapped  countries  (82%).  Likewise,  while  the  completeness  index  for  Swe- 
den is  reported  as  almost  13%,  almost  more  than  half  of  this  country  is  covered 
by  forests.  This  justifies  the  low  completeness  index  value  as  minor  residents 
live  there  or  mappers  do  not  prioritize  mapping  forests.  This  heterogeneity  and 
inequality  of  public  participation  should  be  further  investigated  as  outlined  in 
(Jokar  Arsanjani  & Bakillah,  2015). 

Local  (metropolitan)  scale 

The  degree  of  completeness  at  local  level  i.e.  metropolitan  area  in  several  coun- 
tries was  checked  and  a wide  range  of  values  from  39%  for  Frankfurt  to  100% 
for  Bucharest  was  achieved.  These  values  are  shown  in  Figure  4. 


Thematic  accuracy 

Apart  from  completeness,  thematic  accuracy  is  a key  criterion  to  judge  about 
the  quality  of  the  contributed  LU  features.  This  is  meant  to  explore  how  prop- 
erly the  land  parcels  are  tagged.  Thematic  accuracy  is  basically  called  accuracy 
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Completeness  index  % 


Figure  4:  Completeness  index  of  OpenStreetMap  land  use  features  for  ten  large 
metropolitan  areas. 


assessment’  in  the  LU/LC  classification  studies,  which  reflects  the  difference 
between  a target  dataset  against  a reference  dataset  (Congalton,  1991;  G.  M. 
Foody  et  al.,  2013;  Giles  M Foody,  2002).  This  is  carried  out  through  summa- 
rizing all  data  in  a confusion  matrix  (i.e.,  error  matrix)  and  calculating  several 
indicators  including  overall/per  class  accuracies’,  ‘Kappa  index  of  agreement’, 
‘user’s  accuracy’  and  ‘producer’s  accuracy’”  (Giles  M Foody,  2002;  Herold  et  al., 
2008).  In  this  study,  a confusion  matrix  analysis  is  applied  to  reach  these  meas- 
ures. A measure  for  the  overall  accuracy  is  calculated  by  dividing  the  number  of 
identical  pixels  by  the  total  number  of  pixels.  However,  it  does  not  identify  how 
well  individual  classes  between  the  two  datasets  match.  Hence,  the  user’s  accu- 
racy and  producer’s  accuracy  should  be  calculated  to  measure  the  accuracy  of 
each  class  (Herold  et  al.  2008).  The  user’s  accuracy  indicates  the  probability  that 
a pixel  from  the  OSM  LU  map  actually  matches  the  GMESUA  dataset,  while  the 
producer’s  accuracy  refers  to  the  probability  that  a specific  LU  type  from  the 
reference  dataset  is  classified  as  such.  These  two  measurements  are  not  neces- 
sarily equal.  For  instance,  if  for  a specific  land  type  of ‘farming’,  with  accuracies 
achieved  of  75%  and  82%  for  user’s  accuracy  and  producer’s  accuracy  respec- 
tively, it  implies  that  as  a user  of  the  data,  roughly  75%  of  all  the  pixels  classified 
as  ‘farming’  are  the  same  in  the  reference  dataset  and,  as  a producer,  only  82% 
of  all  ‘farming’  pixels  are  classified  as  such  (Jokar  Arsanjani  et  al.,  2015). 

In  order  to  assess  how  well  LU  types  in  each  city  are  mapped,  Kappa  index, 
overall  accuracy,  user’s  accuracy,  and  producer’s  accuracy,  are  calculated.  Due  to 
heterogeneous  accuracies  across  cities,  interpretation  of  the  confusion  matrix  is 
discussed  for  each  city  separately  in  (Jokar  Arsanjani,  Mooney,  Zipf,  et  al.,  2015; 
Jokar  Arsanjani  & Vaz,  2015).  Further  to  this,  the  geographical  distribution  of 
agreements  and  disagreements  is  visualized  in  (Jokar  Arsanjani,  Mooney,  Zipf, 
et  al.,  2015).  In  general,  land  classes  such  as  Isolated  structures  [113],  Industrial, 
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commercial,  public,  military  and  private  units  [121],  Road  and  rail  networks 
and  associated  land  [122],  Sport  and  leisure  facilities  [142],  Agricultural+semi- 
natural+wetlands  [200],  Forests  [300],  and  Water  [500]  show  the  high- 
est level  of  agreement  in  the  two  datasets.  In  contrast,  the  remaining  classes 
show  disagreement,  assuming  that  they  are  correctly  reflected  in  the  reference 
(GMESUA)  dataset.  This  brings  up  the  question  whether  OSM  represents  the 
right  classification  or  the  reference  dataset.  Finally,  it  can  be  concluded  that  the 
contributed  OSM-LU  features  are  heterogeneously  distributed  over  inside/out- 
side  urban  areas,  which  confirms  the  availability  of  LU  features  in  both  urban 
and  rural  areas. 


Conclusions  and  recommendations 

The  recent  emergence  and  rapid  evolution  of  VGI  platforms,  such  as  OSM,  has 
involved  a massive  number  of  citizens  to  collect  and  share  geolocated  infor- 
mation and  attributes  about  geographical  objects.  This  bottom-up  process  of 
collecting  individuals  contributions  has  resulted  in  shaping  big  (geo) data, 
which  has  leveraged  new  applications  such  as  indoor  mapping  (Goetz  & Zipf, 
2010),  routing  applications  (Bakillah  et  al.,  2014),  tourism  recommendations 
(Sun,  Fan,  Bakillah,  & Zipf,  2013),  and  environmental  monitoring  (Fritz  et  al., 
2012;  Jokar  Arsanjani  & Vaz,  2015).  Although  the  question  on  how  to  attract 
users  and  how  to  keep  them  active  in  the  crowdsourcing  activities  is  yet  to 
be  addressed,  OSM  has  shown  its  continuing  success  in  attracting  more  than 
2.7  million  users.  Thus,  a considerable  potential  in  OSM  exists  and  is  yet  to  be 
further  explored.  Thus,  in  this  study,  we  comparatively  evaluated  the  complete- 
ness aspect  of  the  contributed  OSM-LU  features  across  Europe  as  well  as  their 
thematic  accuracy  in  ten  large  metropolitan  areas  to  find  out  how  reliable  we 
could  start  exploiting  them. 

Results  show  that  from  a thematic  accuracy  perspective,  the  thematic  quality 
of  OSM  features  range  from  moderate  to  substantial’  rank  of  Kappa  indices 
and  overall  accuracies.  Per-class  analysis  of  the  LU  types  shows  that,  depend- 
ing on  the  city,  Isolated  structures  [113],  Industrial,  commercial,  public,  mili- 
tary and  private  units  [121],  Road  and  rail  networks  and  associated  land  [122], 
Sport  and  leisure  facilities  [142],  Agricultural-i- semi- natural-L wetlands  [200], 
Forests  [300]  and  Water  [500]  reach  the  substantial’  rank  of  accuracies,  which 
means  that  these  classes  are  highly  useable.  It  should  be  noted  that  integrating 
ground-truth  information  with  other  reference  data  for  accuracy  assessment 
could  be  an  alternative  approach  for  producing  hybrid  LU  datasets. 

From  a temporal  accuracy  perspective,  archived  images  from  within  2005- 
2010  have  been  used  for  LU  mapping  and  this  could  have  caused  the  above- 
mentioned  disagreements,  whereas  the  OSM-LU  features  have  mainly  been 
uploaded  within  since  2009,  and  therefore,  some  information  from  OSM  might 
be  even  more  close  to  reality  than  our  reference  data.  Moreover,  the  MMU  of 
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the  GMESUA  datasets  is  0.25-1  ha  and,  therefore,  land  parcels  smaller  than 
this  MMU  are  ignored  in  the  course  of  mapping,  while  in  OSM  even  smaller 
parcels  are  mapped,  i.e.  a smaller  MMU  in  OSM  is  possible.  This  means  that  in 
some  parts  while  a polygon  in  GMESUA  dataset  is  representing  a specific  LU 
type,  the  same  area  in  OSM-LU  dataset  is  covered  by  multiple  small  polygons 
showing  multiple  land  types. 

Concerning  the  volunteers  recognition  of  LU  features,  the  citizens’  percep- 
tion of  LU  types  should  be  further  investigated  to  understand  the  way  they 
visually  interpret  LU  types  from  the  online  image  libraries  in  OSM.  As  a final 
conclusion,  the  OSM-LU  features  message  a promising  data  source  for  updat- 
ing LU  inventories.  Certainly,  the  longer  OSM  exists,  the  more  contributions 
will  be  received  and  consequently  higher  data  quality  can  be  achieved. 

This  study  points  out  some  other  recommendations  to  the  LU  researchers, 
environmental  scientists,  policy  makers,  among  others  that  will  lead  future 
research  possibly  in  more  suitable  directions.  Based  on  the  presented  com- 
pleteness indices  across  Europe,  as  well  as  the  accuracy  values  of  the  selected 
cities,  the  contributed  OSM-LU  features  account  for  a potential  alternative  data 
source  for  mapping  LU.  Further  studies  on  other  areas  must  be  conducted  to 
explore  the  heterogeneity  of  completeness  and  thematic  accuracy  across  space. 
Furthermore,  applying  data  mining  techniques  and  data  fusion  with  national 
and  regional  datasets  (e.g.,  GMESUA)  for  extracting  the  LU  information  of 
unmapped  areas  are  of  high  importance.  Additionally,  the  land  types  with  the 
highest  reliability  can  be  separately  incorporated  into  respective  applications. 
This  enables  experts  to:  (a)  possibly  find  ways  to  draw  the  attention  of  volunteer 
mappers  to  mapping  LU  features  by  highlighting  their  importance  for  more 
effective  environmental  monitoring,  (b)  possibly  improve  the  OSM  ontology 
of  the  LU  dataset,  (c)  maximize  the  efficiency  of  OSM  for  LU  mapping  as  users 
are  not  able  to  add  further  features  in  the  urban  areas,  because  the  massive  vol- 
ume of  mapped  objects  (e.g.  POIs,  roads,  building,  etc.)  do  not  let  users  to  have 
enough  space  for  adding  LU  features. 
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Abstract 

During  the  last  decade,  Crowdsourced  Geographic  Information  or  Volunteered 
Geographic  Information  has  attracted  the  attention  of  the  research  community 
to  explore  this  vast  amount  of  data  and  extract  useful  information  for  various 
applications.  Among  these,  geotagged  photos  shared  publicly  online  have  been 
explored  as  potential  source  for  Land  Use/Cover  (LULC)  mapping  creation  and 
validation.  In  this  work,  we  performed  an  evaluation  of  the  adequacy  of  the 
geotagged  photos  available  from  the  Panoramio  initiative  for  monitoring  LULC 
in  urban  areas,  with  a study  conducted  in  the  urban  area  of  Rome,  Italy.  We 
investigated  the  temporal  distribution  of  the  photos  for  the  time  range  2007- 
2013  with  different  resolution  (year,  season  and  month)  as  well  as  the  spatial 
distribution  in  the  study  area.  Then,  we  evaluated  the  representativeness  of  the 
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Panoramio  dataset  for  each  LULC  class  by  using  the  Urban  Atlas  database  as  a 
reference  and  computing  the  number  and  density  of  photos  per  square  km  for 
each  class.  Finally  we  discussed  the  main  limitations  of  the  dataset  for  LULC 
monitoring  in  urban  areas  and  we  proposed  alternative  approaches  useful  to 
overcome  some  of  the  identified  limitations. 


Keywords 

Geotagged  photographs,  Panoramio,  Volunteered  geographic  information. 


Introduction 

Land  Use/Cover  Change  (LULCC)  is  one  of  the  most  relevant  phenomena 
caused  by  humans  and  linked  with  global  environmental  and  climate  change. 
LULCC  trends  are  characterized  by  loss  of  natural  areas  and  by  the  expansion 
of  urbanization  due  to  population  increase.  Within  urban  areas,  landscape  and 
soil  functions  are  threatened  by  the  expansion  of  artificial  surfaces  and  there- 
fore mapping  and  monitoring  LULCC  is  crucial  to  support  proper  planning 
decisions.  Nevertheless,  the  creation  and  update  of  geographic  information 
occur  with  low  frequency  being  realized  institutionally  by  mapping  agencies 
and  being  particularly  expensive  (Goodchild  2008). 

The  geographic  information  created  and  shared  by  the  crowd  has  been 
increasing  significantly  over  the  last  decade  and,  today,  might  be  a potential 
alternative,  to  some  extent,  to  the  official  map-making.  This  Crowdsourced 
Geographic  Information  phenomenon,  also  called  Neogeography  (Tuner 
2006),  Volunteered  Geographic  Information  (Goodchild  2007),  and  more 
recently  Ambient  Geographic  Information  (Stefanidis,  Crooks  & Radzikowski 
2011),  have  attracted  the  attention  of  the  research  community  to  explore  this 
vast  amount  of  data  and  extract  useful  information  for  various  applications 
(e.g.  Arsanjani  et  al.  2013).  In  particular,  geotagged  photos  have  been  explored 
for  Land  Use/Cover  (LULC)  applications  (Estima  & Painho  2013;  Estima  & 
Painho  2014;  Lupia,  Estima  & Painho  2015)  at  different  geographic  scales 
showing  their  potential  for  this  kind  of  application,  despite  some  issues  related 
with  this  type  of  data  such  as  the  positional  accuracy  or  the  content  of  photos. 

The  main  contribution  of  this  paper  is  to  analyse  the  potential  use  of  geotagged 
photos  in  monitoring  LULC  in  urban  areas  by  discussing  also  the  main  limita- 
tions and  suggesting  improvements  and  possible  solutions  to  overcome  them. 
We  focused  on  the  Panoramio  initiative  with  a case  study  in  the  urban  area  of 
Rome,  Italy  (Figure  1).  We  explored  the  temporal  and  spatial  characteristics  of 
the  Panoramio  dataset  by  assessing  the  representativeness  of  the  photos  in  each 
LULC  class  and  using  the  last  version  of  the  Urban  Atlas  database  (EE A 2012) 
as  a reference. 
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Methodology 

Datasets  and  study  area 

We  performed  our  analysis  in  the  inner  area  of  Rome  (Italy)  covering  343  km2 
and  delimited  by  the  Grande  Raccordo  Anulare  ( GRA ),  the  highway  encircling 
the  urban  area.  The  city  has  experienced,  during  the  last  fifty  years,  relevant 
land  use  changes  with  phenomena  such  as  soil  sealing  and  urban  sprawl  that 
have  modelled  the  actual  spatial  structure. 

The  Urban  Atlas  (UA)  2006  dataset  was  used  as  a reference  for  LULC  data. 
The  UA  was  produced  in  2009  through  the  Global  Monitoring  for  Environment 
and  Security  (GMES)  program.  The  LULC  classes  are  based  on  the  Corine  Land 
Cover  with  a detailed  characterization  of  the  artificial  classes  by  providing  the 
degree  of  sealing  in  percentage  for  some  subclasses  (Sealing  Layer).  The  geomet- 
ric resolution  is  1:10,000  with  a minimum  mapping  unit  of  0.25  ha  (EEA  2012). 


Figure  1:  The  study  area:  the  urban  area  of  Rome  delimited  by  the  round 
shaped  highway  Grande  Raccordo  Anulare  (in  light  red).  Points  depict  the 
spatial  distribution  of  the  Panoramio  geotagged  photos  extracted  for  the  time 
range  2007-2013. 
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Urban  Atlas  class 

Area  (km2) 

% 

Continuous  Urban  Fabric 

36.42 

10.61% 

Discontinuous  Dense  Urban  Fabric 

41.74 

12.16% 

Discontinuous  Medium  Density  Urban  Fabric 

15.24 

4.44% 

Discontinuous  Low  Density  Urban  Fabric 

8.55 

2.49% 

Discontinuous  Very  Low  Density  Urban  Fabric 

1.84 

0.54% 

Isolated  Structures 

0.88 

0.26% 

Industrial,  commercial,  public,  military  and  private  units 

57.47 

16.74% 

Fast  transit  roads  and  associated  land 

1.94 

0.57% 

Other  roads  and  associated  land 

32.00 

9.32% 

Railways  and  associated  land 

4.41 

1.28% 

Airports 

1.00 

0.29% 

Mineral  extraction  and  dump  sites 

0.64 

0.19% 

Construction  sites 

5.60 

1.63% 

Land  without  current  use 

3.40 

0.99% 

Green  urban  areas 

28.53 

8.31% 

Sports  and  leisure  facilities 

13.43 

3.91% 

Total  Artificial  surfaces 

253.09 

73.72% 

Agricultural  + Semi-natural  areas  + Wetlands 

79.79 

23.24% 

Forests 

7.51 

2.19% 

Water  bodies 

2.91 

0.85% 

Grand  total 

343.30 

100.00% 

Table  1:  Area  and  percentage  over  the  total  of  the  Urban  Atlas  classes  in  the 
study  area. 

The  study  area  (see  Table  1)  is  covered  by  Artificial  surfaces  for  almost  two- 
thirds  (73.72%),  followed  by  Agricultural  + Semi-natural  areas  + Wetlands 
(23.24%),  Forests  (2.19%)  and  Water  bodies  (0.85%). 

The  dataset  containing  all  the  publicly  available  geotagged  photos  from  the 
Panoramio  initiative  was  created  by  downloading  the  metadata  for  all  the  avail- 
able photos  within  the  study  area.  This  task  was  performed  by  using  a script  to 
contact  the  Panoramio  servers  through  their  public  Application  Programming 
Interface  (API)  and  collect  the  available  metadata  for  each  photo  (e.g.  latitude, 
longitude,  photo  ID,  user  ID,  upload  date,  etc.).  The  resulting  dataset  was  com- 
posed by  a total  of  26,908  georeferenced  photographs  for  the  time  interval  16 
October  2005  - 11  August  2014.  The  dataset  was  then  converted  to  a GIS  for- 
mat, a point  shapefile  in  this  case,  using  the  latitude  and  longitude  attributes  of 
each  photo. 
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As  the  year  of  2014  was  not  complete  and  there  was  a very  small  number 
of  photos  during  the  first  years  of  the  initiative  (183  total  photos  for  2005  and 
2006)  and  this  could  bias  the  results  in  terms  of  temporal  analysis,  we  decided 
to  remove  them  from  the  final  collection  of  photos.  Therefore,  a subset  contain- 
ing 24,367  photos  for  the  period  2007-2013  was  extracted  and  used  for  the  sub- 
sequent analysis.  The  subset  excluded  also  the  1,035  pictures  inside  the  Vatican 
City  State  for  which  LULC  data  from  UA  were  not  available. 


Data  analysis 

To  assess  the  potential  of  the  Panoramio  dataset  for  monitoring  LULC  in  the 
urban  area  of  Rome,  the  following  method  was  used: 

1)  Analysis  of  the  temporal  distribution  of  the  photos.  We  used  the  “upload 
date”  tag  of  the  Panoramio  dataset  to  understand  the  temporal  distribu- 
tion within  the  study  area  by  using  three  resolution:  month,  season  and 
year.  Monthly  and  seasonal  temporal  distributions  were  evaluated  by 
using  the  average  number  of  photos  for  the  time  range  2007-2013. 

2)  Analysis  of  the  spatial  distribution  of  the  photos  within  the  study  area.  We 
observed  the  spatial  distribution  of  photos  within  the  study  area  to  verify 
uniformity  or  clustering  both  through  visual  inspection  and  by  comput- 
ing number  and  density  of  photos  (number  of  photos  per  km2)  for  some 
spatial  units. 

3)  Analysis  of  the  spatial  distribution  of  the  photos  within  each  LULC  class. 
We  computed  the  number  and  density  of  photos  for  each  UA  class  to 
assess  the  degree  of  coverage  for  every  LULC  class  inside  the  study  area. 


Results  and  discussion 
Temporal  distribution 

Results  by  year  show  an  increase  of  the  number  of  photos,  after  the  start  of 
the  initiative,  with  maximum  values  in  2011  (4,144)  and  2012  (4,379)  and  a 
yearly  average  of  3,481  for  the  period  2007-2013  (Figure  2-a).  The  distribution 
of  the  average  number  of  photos  by  month  has  the  highest  values  in  February, 
October  and  November  and  a minimum  in  September  (Figure  2-c).  By  observ- 
ing the  average  distribution  per  season  the  majority  of  photos  are  uploaded  in 
winter,  on  the  contrary  the  lowest  values  are  in  summer  (Figure  2-b).  A pos- 
sible explanation  to  this  temporal  trend  could  be  that  a large  part  of  photos  are 
taken  by  tourists  from  other  countries  during  their  summer  vacation,  while  the 
uploading  phase  is  postponed  to  the  winter  time  because  they  don’t  have  high 
speed  internet  connection  to  share  the  photos  immediately. 
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Figure  2:  Number  of  photos  per  year  with  lines  of  the  average  and  the  linear  increasing  trend  for  the  period  2007-2013  (a);  seasonal 
(b)  and  monthly  (c)  average  of  photos  for  the  period  2007-2013. 
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Spatial  distribution  within  the  study  area 

A visual  analysis  of  the  spatial  distribution  of  the  photos  show  a strong  concen- 
tration in  the  urban  centre  where  the  main  tourist  attractions  are  located,  while 
moving  outward  the  concentration  decrease  strongly  (Figure  1).  Over  the  whole 
study  area  the  average  density  is  71  photos/km2.  However,  this  value  changes 
abruptly  across  the  study  area  where  photos  create  clusters  of  different  size  and 
shape.  Photos  can  be  concentrated  along  linear  features,  for  example,  the  cluster 
along  the  South-East  direction  (Figure  3)  is  centered  on  the  famous  ancient 
road  Via  Appia  Antica  (633  photos/km2  inside  a 50  meters  buffer  around  the 
centreline  of  the  road).  Another  example  is  the  Vatican  area.  Although  Vatican 
is  not  considered  for  this  study  we  calculated  the  density  of  photos  to  under- 
stand the  impact  of  tourist  attractions  to  the  availability  of  data;  as  expected, 
this  small  area  (0.53  km2)  has  an  extremely  high  density  (1,957  photos/km2  ca.). 


Spatial  distribution  over  UA  classes 

In  terms  of  number  of  photos  the  majority  is  concentrated  inside  Artificial 
surfaces  (22,713  photos,  representing  93.27%),  followed  by  Agricultural  + 


Figure  3:  Concentration  of  the  Panoramio  photos  along  the  famous  ancient 
road  Via  Appia  Antica.  The  number  and  the  density  of  photos  were  computed 
for  the  area  delimited  with  a 50  m wide  buffer  along  the  centreline  of  the  road. 


292  European  Handbook  of  Crowdsourced  Geographic  Information 


Semi-natural  areas  + Wetlands  (1,007  photos,  representing  4.14%),  Water  bod- 
ies (598  photos,  representing  2.46%)  and  Forests  (35  photos,  representing 
0.14%).  Two-thirds  of  the  photos  belonging  to  the  Artificial  surfaces  are  dis- 
tributed in  the  following  subclasses:  Industrial,  commercial,  public,  military 
and  private  units  (6,862  photos,  representing  28.18%),  Other  roads  and  associ- 
ated land  (5,401  photos,  representing  22.18%)  and  Continuous  Urban  Fabric 
(4,324  photos,  representing  17.76%),  see  Figure  4-b. 


Figure  4:  Density  (number  of  photos  per  km2)  (a);  number  of  photos  over  the 
total  (in  percentage)  for  each  Urban  Atlas  class  (b). 
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In  terms  of  density  (number  of  photos  per  km2),  Water  bodies  have  the  highest 
density  (205.29),  followed  by  the  Artificial  surfaces  (89.74),  Agricultural  + Semi- 
natural areas  + Wetlands  (12.62)  and  Forests  (4.66),  see  Figuere  4-a.  Within  the 
Artificial  surfaces  the  following  subclasses  have  the  highest  values  of  density: 
Other  roads  and  associated  land  (168.8),  Industrial,  commercial,  public,  mili- 
tary and  private  units  (119.4),  Continuous  Urban  Fabric  (118.7),  Railways  and 
associated  land  (104.8)  and  Green  urban  areas  (95.3).  The  density  and  the  num- 
ber of  the  photos  within  the  UA  classes  confirm  a strong  unevenness  with  a pre- 
dominance of  potential  information  in  the  Artificial  surfaces  and,  surprisingly, 
in  the  Water  bodies,  which  correspond  to  the  Tiber  River.  The  latter  result  can  be 
explained  with  the  large  number  of  photos  (598)  spread  over  a very  small  surface 
of  the  study  area  (2.91  km2,  0.85%).  Tiber  River  is  an  important  landmark  with 
several  relevant  tourist  attractions  along  its  banks,  but  also  a place  monitored  for 
environmental  aspects.  In  fact,  76  out  of  598  (12.71%)  photos  were  published  by 
a public  authority  during  field  observations  during  the  period  2007-2013  and 
190  out  of  736  (25.82%)  during  the  period  16/10/2005  - 11/08/2014. 

Conclusions 

In  this  paper,  we  analysed  the  potential  of  geotagged  photos  from  the  Panora- 
mio  initiative  as  a source  of  information  for  LULC  monitoring  in  urban  areas, 
with  a case  study  in  the  city  of  Rome. 

Similarly  to  what  has  been  reported  in  Estima  and  Painho  (2013,  2014),  the 
most  positive  aspects  of  this  dataset  are  the  amount  of  available  photos  and 
their  temporal  distribution.  On  the  opposite  side,  this  dataset  showed  some 
limitations  for  urban  land  use  monitoring  analysis.  Some  LULC  classes  have 
a better  coverage  of  photos  compared  to  others,  generally,  artificial  areas  and 
areas  where  important  landmarks  and  tourist  attractions  are  located.  The 
potential  use  of  photos  may  be  not  homogeneous  among  different  urban  areas, 
with  famous  touristic  places  having  usually  more  photos  than  urban  areas  that 
do  not  have  any  famous  landmarks.  Uneven  temporal  distribution  might  be 
also  found  in  some  places  and  LULC  classes  as  special  events  attracting  a high 
number  of  people  occur  in  particular  dates.  The  metadata  downloaded  from 
Panoramio  include  the  date  when  photos  have  been  uploaded  that  in  most 
cases  is  not  the  same  date  when  they  were  actually  taken.  This  issue  bias  the 
temporal  characteristic  of  photos  and  affect  their  reliability  if  one  needs  to  con- 
sider the  temporal  aspect.  Finally,  the  actual  content  of  photos  that  in  some 
cases  do  not  show  a subject  related  to  LULC. 

There  are  few  solutions  to  address  some  of  these  issues.  Downloading  addi- 
tional metadata  from  the  initiative,  such  as  the  Exif  information,  not  available 
currently  through  the  Panoramio  public  API,  would  solve  the  date  mismatch 
once  it  integrates  the  date  when  the  photo  was  taken  and  add,  in  some  cases, 
even  more  information  (e.g.  the  zoom  level).  Also  the  integration  of  photos 
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from  other  available  and  similar  initiatives  such  as  Flickr,  Instagram,  among 
others,  could  increase  the  reliability  of  this  type  of  data  in  some  aspects. 
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Abstract 

This  chapter  describes  AtrapaelTigre.com , a citizen  science  project  focusing  on 
the  Asian  tiger  mosquito  in  Spain.  Commonly  known  for  its  aggressive  biting 
during  the  day,  the  tiger  mosquito  represents  a global  environmental  problem. 
It  is  an  invasive  species  and  a vector  for  dengue,  chikungunya  and  other  dis- 
eases, making  it  a serious  public  health  risk.  It  is  also  an  everyday  nuisance  and 
a threat  to  tourism  and  related  industries.  The  management  of  invasive  species, 
and  particularly  disease  vectors,  requires  integrated  programs  that  combine 
public  communication  and  education  with  research,  surveillance  and  control. 
AtrapaelTigre.com  aims  at  achieving  this  by  engaging  citizen  scientists  to  raise 
awareness  and  collect  data  on  tiger  mosquito  adults  and  their  breeding  sites 
with  a smartphone  app  ( Tigatrapp ) and  a multi-proxy  data  validation  system 
that  combines  expert,  crowd,  and  app-user  input.  Lessons  learned  during  the 
first  year  of  implementation  in  Spain,  in  2014,  have  guided  our  current  strate- 
gies with  respect  to  both  tiger  mosquitoes  and  the  formal  integration  of  citizen 
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science  into  the  research,  surveillance  and  control  of  invasive  species  and  dis- 
ease vectors  generally.  We  address  the  challenges  of  implementing  such  frame- 
works and  discuss  their  fitness  for  use  in  public  health  systems.  The  goal  of 
AtrapaelTigre.com  is  not  only  to  enhance  participation  and  raise  awareness,  but 
also  to  promote  novel  research  and  a more  informed  and  cost-effective  man- 
agement of  the  tiger  mosquito  across  Spain. 


Keywords 

citizen  science,  volunteered  geographic  information,  invasive  species,  disease 
vectors,  Asian  tiger  mosquito 


Introduction 

AtrapaelTigre.com  (‘Catch-the-Tiger’)  is  a citizen  science  project  with  a Vol- 
unteered Geographic  Information  (VGI)  (Goodchild  2007)  component  that 
enlists  ordinary  people  in  the  research,  surveillance  and  control  of  Asian  tiger 
mosquitoes  (Aedes  albopictus)  (Skuse  1894)  in  Spain.  The  tiger  mosquito  is  an 
invasive  species  from  Southeast  Asia  that  has  spread  worldwide  and  become 
common  in  developed  landscapes  (Hawley  1988).  It  is  well  known  for  biting 
aggressively  during  the  day,  and  importantly,  it  is  a vector  of  such  diseases  as 
dengue  and  chikungunya  (Paupy  et  al.  2009).  The  species  is  relatively  easy  to 
recognize  from  its  behavior  and  appearance,  and  it  was  first  detected  in  Spain 
in  2004  (Aranda  et  al.  2006).  The  tiger  mosquito  is  now  established  along  the 
Spanish  Mediterranean  coast  (Alarcon-Elbal  et  al.  2014),  where  it  threatens 
public  health  and  degrades  the  quality  of  life,  while  also  harming  the  tourism 
sector  (Roiz  et  al.  2007),  which  peaks  in  the  summer,  just  when  the  species  is 
most  active. 

The  efficacy  of  public  management  programs  is  limited  because  tiger  mos- 
quitoes breed  not  only  in  public  spaces,  but  also  in  small  water  containers  in 
private  areas,  such  as  the  plates  people  place  under  their  flower  pots  on  balco- 
nies and  patios.  Removing  the  water  from  these  containers  prevents  the  devel- 
opment of  larvae  and  may  significantly  reduce  the  presence  of  adults  at  a given 
place.  This  can  be  especially  effective,  since  the  lifetime  dispersal  range  of  the 
species  is  only  around  600  m (Hawley  1988).  Therefore,  awareness-raising  and 
education  campaigns  are  key  tools  for  control  programs.  This,  combined  with 
the  ease  with  which  tiger  mosquitoes  can  be  identified  and  the  extent  to  which 
they  are  often  well  known  to  the  communities  in  which  they  are  prevalent, 
makes  the  species  a good  target  for  citizen  science.  A number  of  projects  have 
recently  begun  using  public  participation  to  monitor  mosquitoes  elsewhere 
(e.g.  Miickenatlas  in  Germany,  iMoustique  in  France)  (Kampen  et  al.  2015). 
Although  invasive  species  are  common  citizen  science  targets  (e.g.  Dickinson 
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et  al.  2010),  it  is  much  less  common  to  specifically  target  disease  vectors  like  the 
tiger  mosquito  that  affect  humans. 

AtrapaelTigre.com  has  two  specific  objectives:  (1)  to  explore  new  methodolo- 
gies for  acquiring  data  for  tiger  mosquito  research,  surveillance  and  control 
through  public  participation,  and  (2)  to  raise  public  awareness  and  promote 
household  control  actions.  The  project  was  initiated  in  2013  as  a pilot  in  a small 
region  of  Spain  with  a limited  target  group  of  participants.  The  know-how  and 
network  acquired  served  to  extend  the  approach  in  2014  to  the  whole  terri- 
tory of  Spain.  Here  we  present  the  existing  system  (data  collection,  validation 
and  visualization)  and  the  lessons  learned  in  2014,  during  the  first  year  of 
implementation  in  Spain.  We  also  placed  the  project  and  future  challenges  in  a 
broader  conceptual  framework  useful  for  other  citizen  science  projects  target- 
ing invasive  species  and  disease  vectors. 


Data  collection  and  validation  approach 

Data  collection  is  done  using  the  smartphone  app  Tigatrapp , available  on 
Google  Play  and  iTunes.  Tigatrapp  is  free  and  open  source  software,1  and  it 
may  be  redistributed  or  modified  under  the  GNU  General  Public  License  (ver- 
sion 3).  Participation  is  anonymous,  but  participants  must  consent  first  to  the 
privacy  policy  and  terms  of  use.  With  Tigatrapp , ordinary  people  can  collect 
and  send  geolocalized  reports  of  tiger  mosquitoes  and  their  breeding  sites 
(Figure  1).  To  do  this,  they  need  to  learn  how  to  identify  the  tiger  mosquito 
(including  basic  taxonomy  and  life  cycle),  and  this  information  is  provided  by 
the  app  and  the  project  website,  as  well  as  in  workshops  and  talks  organized 
throughout  the  mosquito  season.  Tigatrapp  reports  include  (Figure  1):  i)  loca- 
tion, obtained  by  the  app  directly  from  the  devices  GPS  receiver  or  its  network 
connections  or  from  the  user  selecting  the  location  on  a map,  ii)  key  taxonomic 
traits  of  the  reported  mosquito  or  characteristics  of  the  reported  breeding  site 
based  on  a small  survey  (i.e.  user  level  validation),  iii)  photographs  (compul- 
sory for  breeding  sites  but  optional  for  adult  mosquitoes,  which  are  often  hard 
to  photograph),  and  iv)  optional  complementary  notes. 

The  app  also  collects  5 randomly-timed,  anonymous  samples  of  user  loca- 
tions on  a daily  basis  (although  users  have  the  option  to  switch  this  feature  off). 
This  background  location  system  (Figure  1)  is  used  to  estimate  sampling  effort 
and  territorial  exposure,  making  it  possible  to  adjust  the  report  analysis  for 
the  fact  that  users  are  not  randomly  distributed  across  the  landscape.  In  other 
words,  the  background  location  system  helps  in  identifying  the  extent  to  which 
reports  from  a given  area  are  being  driven  by  the  density  of  users,  the  density  of 
mosquitoes,  or  both.  In  order  to  protect  privacy,  all  background  location  infor- 
mation is  masked  on  users’  devices  by  placing  locations  on  a predetermined 
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DATA  COLLECTION  APP  TIGATRAPP 

REPORTS 

MOSQUITOES 

BREEDING  SITES 

Location 

Taxonomic  survey 
Photographs  (optional) 

Notes  (optional) 

Location 

Sites  survey 
Photographs 

Notes  (optional) 

OTHER  DATA 

BACKGROUND  LOCATION 

MISSIONS 

Approximate  location  5 times/day 
(optional) 

Variable  information 
sent/collected  to/from  a specific 
region  (optional) 

Figure  1:  Data  collected  with  Tigatrapp.  Background  location  and  missions  are 
not  yet  implemented  in  the  iOS  version  of  the  app  due  to  financial  constraints. 


grid  of  0.05  degrees  latitude  and  longitude.  Only  the  grid  cell  identifier  is  trans- 
mitted from  the  device  to  the  server,  making  it  impossible  to  determine  the 
actual  location  within  the  cell.  In  addition,  these  locations  are  identified  only 
by  a code  that  is  randomly  assigned  on  the  users  device,  without  any  additional 
information  or  any  way  to  link  the  location  to  a given  users  reports. 

Finally,  the  app  incorporates  the  concept  of  missions  (Figure  1),  which  are 
extra  voluntary  activities,  notes  or  surveys  that  are  sent  as  an  incoming  notifica- 
tion to  Tigatrapp.  These  can  be  set  to  appear  on  the  devices  of  all  users  or  only 
those  in  a given  area.  Missions  allow  the  implementation  of  extra  activities  on 
the  go,  and  direct  communication  with  participants,  according  to  specific  needs. 

The  two  types  of  reports  (mosquito  adults  and  breeding  sites)  and  the  sam- 
pling effort  (covered  area)  can  be  visualized  on  a webmap  embedded  in  the  pro- 
ject website.  The  coverage  map  is  also  a useful  tool  for  increasing  user  engage- 
ment, as  people  are  able  to  see  the  extent  to  which  the  app  is  being  used  across 
Spain  and,  indeed,  throughout  the  world.  Raw  data  is  also  available  to  some 
scientists  (even  outside  the  project)  and  to  some  public  administrations  respon- 
sible for  tiger  mosquito  control  and  additional  data  will  be  made  available  to  the 
public  through  public-access  repositories  and  other  means  in  the  future. 

To  validate  reports,  a multi-proxy  system  that  combines  expert,  crowd  and 
app-user  validation  has  been  implemented  (Figure  2).  Expert  and  crowd  val- 
idation are  based  on  the  analysis  of  report  photographs  and  are  done  using 
two  different  on-line  platforms:  a)  a custom-built  platform  for  experts  and  b) 
Crowdcraffing.org  ( Tigafotos  project)  for  the  public.  App-user  validation  is 
based  on  the  user’s  responses  to  the  survey  contained  in  each  submitted  report. 
For  expert  validation,  the  experts  analyze  the  photographs  to  classify  each 
report  into  one  of  five  categories  (Figure  2)  based  on  the  assessed  probability 
of  its  being  accurate  (i.e.  the  probability  that  the  user  actually  observed  a tiger 
mosquito  or  a tiger  mosquito  breeding  site).  Expert  validation  is  done  before 
publishing  report  photographs  on  the  webmap  and  thus,  also  serves  to  filter 
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out  pictures  that  are  sensitive  (e.g.  bites  on  peoples  bodies),  include  personal 
information  (e.g.  whole  body  or  face  pictures),  or  are  unrelated  to  the  project 
(although  there  have  been  few  of  these).  Experts  can  also  decide  to  publish 
the  app-user  s note  (if  available)  or  to  add  their  own  note  in  the  report  pop- 
up on  the  webmap.  Expert  validation  results  are  used  as  the  main  filter  and 


No 

No  photograph  validation 
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Photograph  validation 

CROWD  VALIDATION  ( Crowdcrafting ) 

EXPERT  VALIDATION  (own  platform) 

Assigns  a category  based  on  the 
opinion  of  the  crowd 

Assigns  a category  based  on  the 
opinion  of  several  experts 

Report  classified 


PUBLIC 

WEBMAP 


Report  not  classified 


EXAMPLE:  CATEGORY  CLASSIFICATION  FOR  MOSQUITO  REPORTS 
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UNKNOWN 
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properly 
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Reports  with  a 
photograph  that 
looks  like  a 
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that  are  not 
informative 
enough  (e.g. 
blurry, 

overexposed) 
for  a better 
decision 

Reports  with  a 
photograph  of  a 
mosquito/similar 
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another 
species  of 
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Figure  2:  Diagram  of  the  multi-proxy  validation  system  (top),  and  categories 
used  for  the  validation  of  mosquito  reports  (bottom).  In  the  middle,  a map 
screenshot  and  a citizen  scientist  s picture  of  a tiger  mosquito. 
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classification  method  in  the  public  webmap.  However,  crowd  (if  available)  and 
app  user  validation  results  are  also  displayed  in  the  report  pop-ups.  Reports 
without  pictures  are  displayed  as  unclassified’. 


Lessons  from  the  first  year  of  implementation  (2014) 

During  the  2014  data  collection  period  (late  spring  to  late  winter  2014),  almost 
7,000  people  downloaded  Tigatrapp  and  registered  as  users  (Figure  3).  In  total, 
-2,900  reports  (including  mission  answers)  were  sent  from  -1,300  unique  user 
identifiers.  Both  app  download  and  data  collection  dynamics  were  strongly 
influenced  by  media  appearances.  Most  reports  (-60%)  were  of  mosquito 
sightings,  followed  by  mission  answers  (-30%)  and  breeding  sites  (-10%).  It  is 
not  clear  why  breeding  site  reports  have  been  so  much  less  frequent  than  adult 
mosquito  reports.  It  could  be  that  people  are  less  motivated  to  report  breeding 
sites  than  to  report  adult  mosquitoes  (which  may  have  just  bitten  them),  or 
that  the  complexity  of  the  breeding  site  concept  (e.g.  indirect  cause-effect  due 
to  mosquito  life  cycles)  or  the  requirement  that  breeding  site  reports  include 
photographs  makes  their  reporting  more  challenging.  Whatever  the  reason, 
the  low  number  of  breeding  site  reports  in  2014  led  us  to  focus  the  expert 
validation  for  that  year  on  only  the  adult  mosquito  reports.  In  contrast,  our 
strategy  for  2015  has  been  to  improve  breeding  site  reporting  by  working  with 
public  mosquito  management  agencies  to  increase  outreach  and  ensure  that 
breeding  site  reports  lead  to  tangible  results  in  urban  public  spaces  (see  next 
Section). 

Participation  in  crowd  validation  (‘ Tigafotos’  project  in  Crowdcrafting.org) 
was  low  (-300  validated  photographs  out  of  -1,200)  and  heterogeneous  in 
time.  Being  hosted  in  a separate  platform,  it  is  difficult  to  compare  participa- 
tion trends  to  Tigatrapp  participation.  On  one  hand,  hosting  a crowd  based 
photograph  validation  system  on  an  international  crowdsourcing  platform  has 
clear  benefits  in  terms  of  participation,  visibility  and  ease  of  implementation. 
On  the  other  hand,  this  approach  has  the  drawback  of  relying  on  participants 
who  are  disconnected  from  the  project  and  its  objectives.  Crowd  engage- 
ment in  the  Tigafotos  project  might  be  improved  by  embedding  it  also  in  the 
AtrapaelTigre.com  website  (Crowdcrafting.org  allows  for  that),  improving  the 
project’s  design,  making  it  multi-lingual,  and  adding  other  gamification  and 
engagement  elements. 

Although  photographs  are  the  basis  for  report  validation  (expert  and  crowd), 
photograph  attachments  are  optional  for  mosquito  reports.  This  allows  users  to 
report  mosquitoes  without  having  to  catch  them.  However,  it  may  be  that  this 
makes  it  too  easy  for  users  to  avoid  making  the  effort  of  taking  a picture,  even 
when  they  are  able  to.  In  2014,  only  around  30%  of  the  adult  mosquito  reports 
actually  included  a photograph  (Figure  3).  Of  those,  expert  validation  approxi- 
mately assigned  -40%  to  the  “Unknown”  category,  -50%  to  ‘Confirmed  Asian 
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tiger  mosquito’  or  ‘Possible  tiger  mosquito’,  and  -10%  to  ‘Possible  other  spe- 
cies” or  “Other  species’  (Figure  2).  Improving  users’  skill  in  taking  photographs 
of  mosquitoes  might  well  result  in  more  valuable  data  by  moving  reports  out 
of  the  ‘Unknown  category  We  are  now  using  Tigatrapp  missions  and  embed- 
ded information,  social  media  (Twitter  and  Facebook),  and  the  project  blog 
to  systematically  train  and  encourage  users  to  take  more  and  better  pictures. 
These  strategies  seem  to  be  improving  the  fitness  for  use  of  data  considerably, 
since  the  number  of  adult  reports  with  pictures  increased  to  60%  (30%  in  2014) 
and  the  number  of  reports  that  could  be  classified  as  “Confirmed  Asian  tiger 
mosquito”  or  “Possible  tiger  mosquito”  in  2015  was  -1,700,  almost  the  same 
as  the  total  number  of  mosquito  reports  (with  or  without  pictures)  received 
in  2014,  i.e.  -1,740  (Figure  3).  We  are  also  developing  quantitative  methods  to 
make  reports  without  photographs  more  useful  for  scientific  and  management 
purposes.  For  example,  by  combining  responses  to  the  taxonomic  survey  in 
each  report  with  knowledge  about  the  user  based  on  the  quality  and  quantity 
of  previous  reports,  we  may  better  assess  the  probability  that  a given  report 
corresponds  to  an  actual  tiger  mosquito.  Other  methods,  like  taxonomic  vali- 
dation of  georeferenced  mosquitoes  sent  by  post  (Kampen  et  al.  2015)  may  be 
explored  in  the  future,  and  the  ultimate  goal  will  be  to  compare  the  results  of 
several  independent  validation  methods,  along  with  semi-automated  and  intel- 
ligent algorithms  based  on  prior  knowledge. 


2014  RESULTS:  1st  YEAR  OF  PROJECT  IMPLEMENTATION  IN  SPAIN 
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Figure  3:  Summary  infographic  of  the  results  obtained  during  2014,  the  first 
year  of  project  implementation  in  Spain  (numbers  are  approximate). 
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Integrating  citizen  science  and  VGI  in  the  management  of 
invasive  species  and  disease  vectors 

Our  experience  with  AtrapaelTigre.com  has  made  it  clear  that  using  citizen  sci- 
ence to  target  invasive  species  or  disease  vectors  requires  at  least  three  basic 
domains  of  expertise:  i)  communication  and  education,  ii)  surveillance  and  con- 
trol and  iii)  research  (Figure  4).  These  domains  and  their  interrelations  acquire 
a whole  new  dimension  as  a consequence  of  the  citizen  scientists  involvement 
and  the  need  to  implement  control  measures  that  complement  current  environ- 
mental management  and  public  health  system  polices.  Here  we  discuss  how  to 
improve  the  project  in  each  of  these  domains,  and  the  main  challenges. 


Communication  and  education 

The  project  has  3 communication  and  education  objectives:  1)  spread  the  word 
to  gain  new  participants  and  wider  geographic  coverage,  2)  keep  the  interest 
of  participants  and  the  media,  and  3)  inform  participants  so  that  they  not  only 
provide  useful  data  but  also  take  control  actions  at  places  out  of  the  scope  of 
public  administrations  (e.g.  their  houses).  For  instance,  by  spotting  and  rec- 
ognizing breeding  sites,  citizens  become  aware  of  the  importance  of  water 
removal  in  their  backyards.  To  accomplish  these  objectives,  the  project  actively 
disseminates  information  through  a blog,  Facebook  and  Twitter  accounts, 
press  releases,  and  talks  and  workshops  for  different  audiences.  The  project  also 
collaborates  with  public  administrations  and  private  stakeholders  and  encour- 
ages these  entities  to  include  information  about  AtrapaelTigre.com  in  their  own 
outreach  campaigns  (e.g.  flyers,  websites,  media  appearances).  In  2015,  we  have 
coordinated  communication  actions  with  the  Barcelona  Public  Health  Agency 
(ASPB)  and  public  entities  in  Valencia  and  the  Canary  Islands,  amongst  others. 
We  are  not  explicitly  assessing  trends  in  public  awareness  or  the  population 
behavior  change  related  to  project  actions.  However,  several  indicators  dem- 
onstrate a good  performance  in  terms  of  communication:  media  appearances 
increased  each  year  (see  project  website  for  a full  list)  and  sessions  in  the  pro- 
ject website  have  more  than  quadrupled  between  2014  and  2015,  based  on  esti- 
mates from  Google  Analytics  website  data  between  July  and  December. 

From  a communication  perspective,  we  have  developed  a two-fold  strategy. 
We  offer  a regular  stream  of  new,  interesting,  and  in-depth  scientific  outreach 
material  related  to  the  targeted  species  and  to  other  mosquito  disease  vectors 
aspects.  At  the  same  time,  we  use  project  results  to  demonstrate  and  explain 
how  citizen  scientist  participation  can  help  to  improve  the  surveillance  and 
control  of  the  species  in  the  short  term.  This  is  a challenge,  but  of  high  impor- 
tance, since  most  people  are  likely  to  participate,  not  out  of  scientific  interest, 
but  because  they  are  affected  by  the  presence  of  the  species  and  have  a personal 
interest  in  its  local  eradication. 
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Figure  4:  Identified  expertise  domains  for  improved  predictive  power  and 
management  of  invasive  species  and  disease  vectors  under  a citizen  science 
framework. 


Surveillance  and  control 

Tiger  mosquito  surveillance  and  control  programs  in  Spain  currently  involve 
a traditional  integrative  management  approach  that  incorporates  communi- 
cation and  education.  The  citizen  science  framework  (Figure  4)  goes  beyond 
this  by  exploiting  new  technologies  (apps,  webmaps,  and  social  media)  that 
enable  massive  and  systematic  calls-to-action  while  making  the  resulting  data 
immediately  available  to  management  services  and  the  general  public.  It  has 
been  demonstrated  elsewhere  that  public  participation  (even  more  through  the 
use  of  new  technologies)  can  advance  the  detection  of  an  invasive  species  even 
2 years  before  traditional  monitoring  programs  (Scyphers  et  al.  2014). 

The  usefulness  of  participative  frameworks  as  early  warning  systems  (the 
primary  role  of  surveillance)  is  confirmed  in  the  case  of  AtrapaelTigre.com.  By 
enlisting  a large  team  of  citizen  scientists  while  also  engaging  with  the  network 
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of  management  agencies  and  other  tiger  mosquito  stakeholders  in  Spain,  we 
have  been  able  to  gather  critical  information  and  pass  it  to  the  actors  respon- 
sible for  surveillance  in  the  implicated  regions.  These  actors  are  then  able  to 
decide  whether  to  further  investigate  and  activate  relevant  environmental  and 
public  health  protocols.  For  example,  the  first-ever  report  of  tiger  mosquitoes 
in  Andalusia  came  from  a citizen  scientist  via  Tigatrapp  in  2014.  After  detect- 
ing this  as  a credible  alarm,  we  contacted  specialists  there,  who  corroborated 
the  presence  of  the  species  in  the  field  (Delacour-Estrella  et  al.  2014).  Similarly, 
citizen  scientists  using  Tigatrapp  were  also  the  first  to  detect  tiger  mosquitoes 
in  the  Catalan  pre-Pyrenees,  a discovery  confirmed  by  rural  agents  and  passed 
through  social  networks  to  raise  awareness  (pers.  comm.).  Citizen  science  sys- 
tems like  AtrapaelTigre.com  should  not  be  seen  as  substitutes  for  active  sur- 
veillance (e.g.  targeted  sampling  methods)  by  specialists.  The  detection  of  the 
species  in  the  Basque  Country  in  2014,  for  instance,  came  about  only  through 
active  surveillance  (Delacour-Estrella  et  al.  2015),  demonstrating  the  extent  to 
which  the  two  approaches  are  complimentary.  Indeed,  efficiency  of  combining 
passive  (e.g.  data  gathered  by  the  general  public)  and  active  surveillance  for 
mosquitoes  in  Europe  is  increasingly  apparent  (Kampen  et  al.  2015). 

Despite  their  importance,  early  warnings  are  only  part  of  the  story.  There 
are  many  regions  in  Spain  where  the  tiger  mosquito  has  already  become  estab- 
lished and  well  known.  The  challenge  for  AtrapaelTigre.com  in  these  regions  is 
to  build  up  a participatory  system  for  management  and  control,  reducing  the 
public  health  risk  and  improving  life  quality.  To  this  end,  we  have  contacted 
actors  and  stakeholders  responsible  for  control  programs  in  these  areas,  and 
developed  tools  (e.g.  interactive  web -interfaces)  to  make  citizen  science  data 
more  accessible  and  useful  for  them.  This  step  is  costly  but  it  has  made  the  pro- 
ject much  more  powerful,  as  the  data  from  citizen  scientists  can  be  immediately 
used  to  improve  management  and  control  in  affected  areas  where  this  joint  col- 
laboration is  established.  For  instance,  in  2015,  the  ASPB  incorporated  part  of 
its  team  directly  into  the  AtrapaelTigre.com  expert  validation  system,  and  it  is 
using  citizen  science  data  from  the  project  to  improve  tiger  mosquito  control  in 
the  city  of  Barcelona.  A similar  strategy  was  followed  by  stakeholders  in  the  city 
of  Valencia  in  the  same  year,  and  all  signs  are  that  this  type  of  involvement  is 
highly  beneficial.  The  numbers  of  breeding  sites’  reports  have  doubled  in  2015 
(although  the  total  number  is  still  much  smaller  than  for  mosquito  reports). 
In  Barcelona,  20%  of  -280  adult  and  breeding  sites’  reports  received  in  2015 
in  and  around  the  city,  were  considered  useful  by  the  ASPB  for  management 
purposes  and  were  incorporated  into  their  already  long-lasting  Public  space 
surveillance  and  control  program.  In  Valencia,  with  a more  recent  history  of  sur- 
veillance programs  starting  in  2014, 40%  of  the  detected  positive  breeding  sites 
in  the  city  were  thanks  to  citizen’s  reports  (breeding  sites  and  adults).  Indeed, 
without  discounting  the  importance  of  engagement  with  the  citizen  scientists 
themselves,  our  renewed  efforts  to  communicate  with  stakeholders  are  proving 
crucial  for  the  long-term  maintenance  of  the  whole  participatory  system. 
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Research 

Often  citizen  science  is  challenged  by  sampling  bias  (based,  for  example,  on 
the  distribution  of  users  across  territory,  or  the  involvement  of  restricted  social 
layers)  and  large  variations  in  the  quality  of  the  data  (Dickinson  et  al.  2010). 
However,  similar  problems  are  also  present  in  field  data  collected  by  profes- 
sional scientists  (e.g.  biodiversity  estimates,  density  estimates  of  a population). 
Mosquito  surveillance  in  many  European  countries  often  relies  on  ovitrap  net- 
works (networks  of  traps  in  which  females  are  prone  to  lay  their  eggs).  Such 
trap  networks  can  detect  the  presence  of  tiger  mosquitoes  but  do  not  provide 
good  estimates  of  abundance,  can  generate  false  negatives,  can  be  biased  by 
placement  and  exposure  time,  and  are  generally  limited  to  recently  colonized 
or  highly  populated  areas,  leaving  large  gaps  in  territorial  coverage.  An  impor- 
tant challenge  for  AtrapaelTigre.com  (and,  by  extension,  to  other  citizen  science 
projects)  is  to  demonstrate  how  the  combination  of  citizen  scientist  data  and 
ovitrap  surveillance  (Kampen  et  al.  2015)  can  improve  predictions  of  the  dis- 
tribution (current  and  potential),  risk  factors,  and  spreading  dynamics  of  the 
targeted  species. 

The  key  is  to  let  the  strengths  of  each  approach  compensate  for  the  others 
weaknesses  and  to  use  reliable  results  from  each  as  a means  of  cross-calibra- 
tion. For  example,  there  is  now  a large  amount  of  ovitrap  data  available  and 
Tigatrapp  data  covering  the  same  areas  of  Spain,  as  well  as  Tigatrapp  data  for 
areas  and  times  for  which  ovitrap  data  is  lacking.  We  are  using  the  ovitrap  data 
in  the  overlapping  areas  to  calibrate  models  built  from  the  Tigatrapp  data,  and 
we  are  then  using  these  calibrated  models  to  make  estimates  about  the  areas 
and  times  for  which  the  ovitrap  data  is  absent.  We  expect  such  novel  model- 
ling approaches  to  produce  more  robust  conclusions  that  contribute  to  cost- 
effective  management  strategies. 


Conclusions 

Once  the  infrastructure  and  basic  implementation  of  a citizen  science  project 
has  been  put  in  place,  attention  turns  to  sustaining  long  term  participation  and 
obtaining  sound  scientific  conclusions  from  the  volunteered  data.  For  projects 
like  AtrapaelTigre.com , that  focus  on  invasive  species  or  disease  vectors,  partici- 
pants are  often  motivated  more  by  management  and  control  goals  than  scien- 
tific interest,  making  it  important  to  work  closely  with  environmental  and  pub- 
lic health  agencies  and  related  stakeholders.  At  the  same  time,  the  project  must 
be  built  on  a solid  scientific  foundation  and  this  requires  novel  approaches  to 
data  validation  and  analysis. 

AtrapaelTigre.com  uses  three  independent  methods  for  validating  citizen  sci- 
entist reports,  as  well  as  a background  location  feature  that  makes  it  possible  to 
estimate  sampling  effort  and  correct  bias.  Moreover,  Tigatrapp’s  passive  citizen 
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scientist  data  is  combined  with  active  ovitrap  surveillance,  with  each  data  type 
complementing  the  other  and  allowing  for  cross-calibration. 

Our  2014  and  2015  results  suggest  that  these  methods  are  promising,  but 
still  must  be  improved.  We  hope  to  increase  more  the  quality  and  quantity  of 
citizen  scientists  photographs  and  to  develop  new  quantitative  methods  for 
assessing  the  reliability  of  reports  without  attached  photographs.  We  also  hope 
to  improve  the  multi-proxy  validation  system,  as  each  of  the  three  validation 
methods  has  its  own  set  of  drawbacks:  the  crowd  and  app-user  validation 
methods  are  capable  of  handling  massive  data  but  are  prone  to  error,  while 
expert  validation  appears  more  accurate  but  is  more  costly  and  better  suited  to 
limited  quantities  of  data.  The  goal  is  to  find  ways  for  the  expert  validation  of 
a manageable  part  of  the  data  to  inform  and  improve  the  crowd  and  app-user 
validation  for  the  rest.  Finally,  another  important  step  will  be  to  formally  frame 
all  of  these  methods  in  a semi-automated,  intelligent  alert  system  that  incorpo- 
rates prior  knowledge  and  directs  interesting  findings  back  to  stakeholders  and 
the  general  public. 


Final  note:  from  AtrapaelTigre.com  to  Mosquito  Alert 

On  February  2016,  at  the  time  of  editing  this  manuscript,  the  project  incor- 
porated a new  target  species,  the  yellow  fever  mosquito  (Aedes  aegypti)  and 
changed  name  accordingly:  from  AtrapaelTigre.com , specific  for  the  tiger  mos- 
quito, to  Mosquito  Alert  (available  at  www.mosquitoalert.com).  The  yellow 
fever  mosquito  has  a similar  appearance  and  behavior  to  the  Asian  tiger  mos- 
quito and  is  currently  considered  the  primary  vector  of  yellow  fever,  chikun- 
gunya,  dengue  and  Zika  viruses  (ECDC  2016),  having  raised  a lot  of  interna- 
tional concern  in  the  recent  American  Zika  outbreak  (Kindhauser  et  al.  2016). 
The  species  was  historically  present  in  Spain,  and  could  be  reintroduced  again 
through  the  island  of  Madeira,  where  it  is  known  to  be  the  cause  of  a dengue 
outbreak  in  2012  (ECDC  2016).  In  this  sense,  the  incorporation  of  this  new 
species  for  the  project  in  Spain  is  relevant  in  public  health  terms  and  follows 
primarily  an  early  warning  system  strategy. 
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Abstract 

In  the  past  few  years,  crowdsourced  geographic  information  (also  called  vol- 
unteered geographic  information)  has  emerged  as  a promising  information 
source  for  improving  urban  resilience  by  managing  risks  and  coping  with  the 
consequences  of  disasters  triggered  by  natural  hazards.  This  chapter  presents 
a typology  of  sources  and  usages  of  crowdsourced  geographic  information  for 
disaster  management,  as  well  as  summarises  recent  research  results  and  present 
lessons  learned  for  future  research  and  practice  in  this  field. 
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Introduction 

The  potential  of  Crowdsourced  Geographic  Information  (CGI)  as  a new  infor- 
mation source  for  disaster  risk  management  has  been  paradigmatically  shown 
during  the  earthquake  that  hit  Haiti  in  2010.  Due  to  a lack  of  official  data,  infor- 
mation gathered  from  social  media,  via  SMS  and  from  OpenStreetMap  became 
crucial  for  disaster  response.  In  the  past  few  years,  CGI  found  their  way  into 
different  disaster  situations  and  scenarios  (Horita  et  al.  2013).  The  use  of  geo- 
graphic information  for  disaster  risk  management  has  attracted  great  interest 
both  in  research  and  practice,  mainly  because  of  the  possibility  to  tap  into  the 
collective  intelligence  or  the  ‘wisdom  of  the  crowds  to  improve  urban  resil- 
ience, i.e.  to  improve  the  capacity  of  urban  areas  to  better  managing  disaster 
risks  and  coping  with  the  effects  of  extreme  events. 

In  general,  CGI  in  the  context  of  disaster  risk  management  can  be  catego- 
rised according  to  the  information  source  into  the  following  types: 

1)  Social  media : Information  produced  by  people  about  the  event  in  usual 
social  media  platforms  (e.g.  Twitter,  Flickr,  Instagram,  Facebook),  such  as 
from  eyewitness  that  exchange  and  disseminate  information  about  a dis- 
aster event. 

2)  Crowd  sensing : Information  collected  from  dedicated  applications  and 
platforms  (e.g.  Ushahidi)  that  are  aimed  specifically  at  producing  infor- 
mation for  disaster  risk  management. 

3)  Collaborative  mapping : Information  about  geographic  features  of  disaster- 
affected  or  disaster-prone  areas,  which  is  produced  by  volunteers  using 
mapping  platforms  (e.g.  OpenStreetMap,  Wikimapia),  e.g.  as  derived 
from  satellite  imagery. 

Although  there  is  a growing  body  of  research  related  to  each  of  these  CGI  types 
in  different  phases  and  tasks  of  disaster  management,  existing  research  studies 
usually  focus  on  a particular  type  of  CGI  and  are  not  able  to  relate  to  relevant 
developments  associated  with  other  CGI  types.  The  goal  of  this  chapter  is  to 
present  to  a holistic  view  of  this  field  by  means  of  a typology  that  is  able  to 
distinguish  the  main  features  and  potentials  of  each  CGI  type  for  disaster  risk 
management.  This  typology  is  valuable  not  only  to  summarise  recent  research 
results,  but  also  to  identify  more  integrated  directions  for  future  research  on 
CGI  towards  improving  disaster  management  and  urban  resilience.  The  next 
sections  are  thus  dedicated  to  exploring  these  issues  for  each  of  the  aforemen- 
tioned CGI  types  in  turn,  followed  by  a conclusion. 

Social  Media 

The  first  type  of  geo -information  produced  by  the  ‘crowd’  in  the  context  of 
disasters  is  related  to  the  use  of  existing  social  media  platforms  to  exchange 
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information.  Social  media  has  been  defined  as  a group  of  Internet-based 
applications  that  build  on  the  ideological  and  technological  foundations  of 
Web  2.0,  and  that  allow  the  creation  and  exchange  of  User  Generated  Con- 
tent’ (Kaplan  & Haenlein  2010).  As  such,  these  platforms  allow  users  to  easily 
share  self-produced  content  within  a network  of  contacts  and/or  for  the  general 
public  in  a variety  of  forms:  texts  via  blogs  (from  ‘web  log’)  or  short  messages 
in  ‘microblogging’  (e.g.  Twitter),  web  pages  and  forums,  photos,  videos,  etc. 
Popular  social  media  platforms  include  Twitter,  Facebook,  Flickr,  YouTube, 
Instagram  etc. 

As  people  are  increasingly  familiar  with  and  ordinarily  use  social  media  in 
their  day-to-day  life,  they  naturally  tend  to  uptake  these  platforms  in  the  occur- 
rence of  a disaster  for  communicating  their  experience  and/or  urgent  needs. 
Indeed,  in  different  catastrophic  events  of  the  past  few  years  - from  the  wild- 
fires in  Southern  California,  USA  in  2007,  over  the  Earthquake  in  Haiti  in  2010, 
up  to  the  recent  super  typhoon  in  the  Philippines  2013  - social  media  has  ena- 
bled the  affected  population  to  produce  information  about  extreme  events  and 
their  catastrophic  impacts  (Sakaki,  Okazaki  & Matsuo  2010;  Crooks  et  al.  2013; 
De  Longueville  et  al.  2010). 

In  the  field  of  disaster  risk  management,  a large  part  of  the  existing  research 
focused  on  the  analysis  of  short  messages  of  the  Twitter  platform,  the  so-called 
Tweets  (Steiger  et  al.  2015).  For  instance,  Sakaki  et  al.  (2010)  and  Crooks  et  al. 
(2013)  investigated  the  use  of  Twitter  for  detecting  and  estimating  the  trajec- 
tory of  earthquakes  in  real  time.  De  Longueville  et  al.  (2010)  proposed  the  use 
of  VGI  as  a sensor  for  detecting  forest  fire  hot  spots,  based  on  previous  work 
that  analysed  the  application  of  Twitter  as  a source  of  spatiotemporal  informa- 
tion for  wildfire  events  in  France.  In  contrast,  Fuchs  et  al.  (2013)  showed  that 
event  detection  based  on  peaks  of  Twitter  activity  did  not  work  for  the  2013 
floods  in  Germany  and  presented  an  analysis  of  spatiotemporal  clusters.  Bakil- 
lah  et  al.  (2014)  applied  graph  clustering  to  support  the  detection  of  geolocated 
communities  in  Twitter  after  the  typhoon  Haiyan  in  the  Philippines.  Further- 
more, a number  of  studies  are  concerned  about  developing  tools  for  visualising 
social  media  data  in  order  to  enable  make- sensing  and  location-  based  knowl- 
edge discovery  (MacEachren  et  al.  2011;  Terpstra  & de  Vries  2012;  Croitoru 
et  al.  2013;  Spinsanti  & Ostermann  2013). 

Another  group  of  studies  seek  to  identify  useful  information  from  social 
media  that  could  be  valuable  for  improving  situation  awareness  (Yin  et  al. 
2012).  For  instance,  Vieweg  et  al.  (2010)  and  Starbird  et  al.  (2010)  analysed 
Twitter  messages  during  the  flooding  of  the  Red  River  Valley  in  the  United 
States  and  Canada  in  2009,  seeking  to  discern  activity  patterns  and  extract 
useful  information. 

Most  of  the  existing  work  in  the  area  has  sought  to  make  sense  of  social  media 
data  as  a stand-alone  source  by  analysing  aggregated  patterns,  e.g.  by  defining 
thresholds  for  the  size  of  spatiotemporal  clusters  of  messages  that  would  serve 
as  signals  for  crisis  events  of  earthquakes  (Sakaki  et  al.  2010,  Crooks  et  al.  2013), 
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wildfires  (De  Longueville  et  al.  2010,  Slavkovikj  et  al.  2014)  or  disease  sur- 
veillance (Gomide  et  al.  2011,  Bernardo  et  al.  2013).  However,  with  such  an 
approach  the  actual  content  of  social  media  messages  is  largely  ignored,  and 
with  this,  much  of  their  potential  to  improve  the  current  knowledge  about  the 
unfolding  situation  is  lost.  Furthermore,  although  event  detection  is  useful  for 
sudden-onset  crises  for  which  there  do  not  exist  any  other  related  data,  in  many 
concrete  cases,  there  are  additional  information  sources  available.  As  pointed 
out  by  Lazer  et  al.  (2014),  one  should  not  see  ‘big  data  as  a substitute  for  all 
existing  data,  but  rather  take  the  challenge  of  doing  innovative  analytics  by 
using  data  from  all  traditional  and  new  sources. 

This  is  in  line  with  a nascent  research  stream  that  uses  VGI  in  combination 
with  other  geodata  sources  in  the  field  of  disaster  management  (Albuquerque 
et  al.  2015;  Schnebele,  Cervone  & Waters  2014;  Triglav-Cekada  & Radovan 
2013;  Spinsanti  & Ostermann  2013).  For  instance,  Albuquerque  et  al.  (2015) 
leveraged  authoritative  sensor  data  of  water  gauges  to  show  that  Tweets  close 
to  flooded  areas  are  more  probable  to  contain  useful  information  for  disaster 
management  (see  Figure  1). 

Building  upon  these  initial  results,  an  important  direction  for  future  research 
endeavours  is  the  development  of  improved  analytical  methods  that  are  able 
leverage  several  different  data  sources  in  order  to  provide  event  detection,  visu- 
alisation and  information  extraction  from  crowdsourced  geo -information  of 
social  media  that  are  better  matched  to  the  needs  of  decision  makers  in  the  field 
of  disaster  management. 


Crowd  Sensing 

A second  type  of  activity  related  to  the  use  of  new  collaborative  technologies 
for  disasters  is  the  emergence  of  the  so-called  ‘crowd  sensing  (Ma  et  al.  2014). 
This  activity  involves  citizens  on  the  Web  that  can  act  as  sensors  and  share  their 
observations.  Differently  from  crowdsourced  information  derived  from  social 
media  covered  in  the  previous  section,  here  the  term  crowd  sensing  is  used  to 
describe  approaches  that  rely  upon  dedicated  software  platforms  for  gathering 
specific  and  structured  data,  as  well  as  for  exploiting  the  interpretive  and  ana- 
lytic skills  and  local  knowledge  of  citizens. 

These  approaches  are  also  related  to  the  concept  of  citizen  science,  which  is 
described  by  Haklay  (2013)  as  ‘scientific  activities  in  which  non-professional 
scientists  voluntarily  participate  in  data  collection,  analysis  and  dissemination 
of  a scientific  project’.  As  such,  people  using  platforms  for  ‘citizens  as  sensors’  or 
‘citizen  scientists’  get  engaged  for  accomplishing  a set  of  tasks  in  a coordinated 
and  purposeful  manner.  These  tasks  mostly  involve  some  kind  of  data  collec- 
tion for  different  types  of  scientific  investigations,  the  most  famous  examples 
being  bird  watching  and  other  types  of  environmental  observations. 
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Figure  1:  Distribution  of  Twitter  messages  that  were  sent  during  the  2013  Elbe 
Floods  in  Germany  (top)  in  contrast  with  flooded  catchments  as  indicated  by 
river  gauges  (bottom)  (adapted  from  de  Albuquerque  et  al.  2015). 


In  the  context  of  disasters,  several  crowd  sensing  platforms  were  created 
including  dedicated  mobile  applications  for  disaster  management  and  earth 
observation  (Ferster  & Coops  2013).  Using  volunteers  to  perform  a specific 
task,  such  as  environmental  monitoring,  collectively  make  a Citizen  Observa- 
tory (CO),  where  data  can  be  collected,  collated  and  published  (Degrossi,  et  al. 
2014;  Liu  et  al.  2015).  Thus,  the  term  Citizen  Observatory  can  be  understood  as 
a software  platform  used  by  citizens  to  produce  volunteered  information  about 
a specific  topic  through  different  devices  (e.g.  web,  mobile  app  and  SMS),  and 
allow  their  visualisation. 

An  important  software  platform  for  implementing  Citizen  Observatories  is 
called  Ushahidi1  (which  means  ‘testimony’  in  Swahili).  This  platform  was  first 
developed  in  the  context  of  election  monitoring  in  Kenya  and  later  developed 


Available  at:  http://www.ushahidi.com  [Accessed  February  12th  2014]. 
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Figure  2:  Flood  Citizen  Observatory  prototype  (adapted  from  Degrossi  et  al. 
2014). 


as  an  open-source  toolbox  that  can  be  deployed  in  several  situations  to  collect 
data  from  on-the-ground  volunteers  via  web  site  and  mobile  application,  but 
also  for  remote  volunteers  to  collaborative  categorise  information  collated  from 
many  sources,  including  social  media  (discussed  in  the  previous  section).  One 
application  example  that  is  built  upon  the  Ushahidi  platform  is  the  prototype 
Flood  Citizen  Observatory  implemented  in  Brazil  (Figure  2)  for  allowing  citi- 
zens to  report  about  the  local  conditions  of  river  levels,  flooded  areas,  as  well  as 
consequences  of  flooding  (Degrossi  et  al.  2014;  Horita  et  al.  2015). 

While  crowd  sensing  and  citizen  observatories  can  be  potentially  used  to 
provide  useful  information  about  the  impacts  caused  by  extreme  events  and 
their  victims,  one  important  issue  is  to  be  addressed  is  how  to  motivate  peo- 
ple to  contribute  with  valuable  information.  Another  important  point  to  be 
addressed  is  how  to  validate  and  integrate  information  from  volunteers  with 
other  sources  of  data  for  effectively  improving  decision-making  related  to  dis- 
aster risk  management. 


Collaborative  Mapping 

The  third  type  of  crowdsourced  geo -information  comprises  a specific  type  of 
information  and  collaboration  platform:  the  collaborative  edition  of  geographic 
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features  to  fulfil  internet-based  interactive  maps.  The  well-known  platforms 
Wikimapia2  and  OpenStreetMap  (OSM)3  fall  into  this  category,  as  well  as  the 
crowdsourcing  component  of  the  popular  GoogleMaps  platform,  so-called 
GoogleMapMaker4  (which,  unlike  the  previous  ones,  does  not  have  an  open 
data  policy  and  thus  does  not  provide  users  with  full  access  to  the  collected 
data). 

A distinctive  feature  of  this  type  of  activity  is  the  collaborative  collection  of 
a very  specific  type  of  data  - namely,  georeferenced  data  about  features  like 
streets  and  roads,  buildings  etc.  - and  the  structuring  of  this  information  in 
form  of  a map.  In  doing  so,  the  volunteer  community  seeks  to  produce  a map 
that  is  as  complete  and  detailed  as  possible,  leveraging  the  local  knowledge  of 
a wide  base  of  users  to  collaboratively  fill  the  gaps.  Recent  research  works  have 
shown  that,  at  least  for  the  regions  with  the  most  active  communities  in  OSM, 
the  results  achieved  a quality  level  that  is  comparable  to  official  and  commercial 
maps  (Neis,  Zielstra  & Zipf  2011;  Haklay  2010). 

The  maps  produced  by  volunteers  in  this  way  are  clearly  of  great  relevance 
in  the  context  of  disasters.  High  quality  and  precise  maps  are  an  important 
resource  for  a number  of  tasks  in  disaster  management,  being  used  from  emer- 
gency planning  up  to  the  coordination  of  relief  efforts.  In  several  disaster  events 
of  the  past  few  years,  the  volunteer  community  has  been  very  actively  engaged 
in  producing  collaborative  maps  to  assist  disaster  management,  especially  the 
community  of  OpenStreetMap.  By  digitising  the  infrastructure,  and  especially 
important,  also  the  level  of  damage  (where  it  can  be  detected),  they  create  a sit- 
uation map  that  can  be  used  by  the  emergency  responders  directly  in  the  field. 

After  the  devastating  earthquake  that  struck  Haiti  in  January  2010,  for 
instance,  there  was  a very  significant  response  of  the  international  OSM  com- 
munity (Neis  et  al.  2010;  Zook  et  al.  2010).  In  the  aftermath  of  that  severe 
quake,  good-quality  maps  were  not  available  to  guide  the  relief  efforts,  and  the 
standard  map  services  in  the  web  (e.g.  GoogleMaps)  lacked  adequate  coverage 
and  had  to  be  updated  to  reflect  the  current  status  of  the  many  blocked  roads 
and  streets.  A few  hours  after  the  quake,  the  volunteers  of  the  OpenStreetMap 
(OSM)  community  around  the  world  started  mapping  remotely  the  affected 
regions  based  on  satellite  imagery,  seeking  to  trace  the  outlines  of  streets,  build- 
ings and  places  of  interest.  Later,  high-resolution  and  very  up-to-date  post-dis- 
aster satellite  images  were  made  freely  available  and  the  OSM  community  could 
then  resort  to  those  in  order  to  also  record  the  damage  of  buildings  and  block- 
ages in  streets  and  roads.  Such  imagery  is  a very  crucial  source  of  informa- 
tion for  the  mappers  to  be  used,  because  it  allows  also  volunteers  to  contribute 
from  all  over  the  world,  not  only  people  that  are  directly  at  the  affected  areas. 
For  large-scale  disaster  events  with  many  international  contributors  this  can 


2 Available  at:  http://wikimapia.org  [Accessed  February  12th  2014]. 

3 Available  at:  http://openstreetmap.org  [Accessed  February  12th  2014]. 

4 Available  at:  https://www.google.com/mapmaker  [Accessed  February  12th  2014]. 
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Figure  3:  Elements  of  the  critical  infrastructure  from  OpenStreetMap  in  the 
Phillipines,  which  was  affected  by  Typhoon  Haiyan  2013  (adapted  from 
Reimer  et  al.  2014). 


generate  highly- detailed  maps  extremely  quickly  The  information  produced 
was  then  available  for  guiding  relief  efforts,  not  only  allowing  better  visual 
orientation  through  the  interactive  maps,  but  also  for  importing  the  data  into 
GPS  devices  for  local  orientation,  as  well  as  using  the  database  behind  OSM 
for  providing  more  sophisticated  services.  For  instance,  an  emergency  rout- 
ing service  was  developed  to  allow  quick  identification  of  the  best  routes  for 
relief  efforts  based  on  the  up-to-date  situation  mapped  by  the  OSM  community 
(Neis,  Singler  & Zipf  2010). 

Ever  since  2010,  numerous  OSM  contributors  provided  their  support  in 
mapping  events  in  the  aftermath  of  a disaster,  producing  the  so-called  Crisis 
Maps.  As  a result,  within  the  OSM  community  a initiative  called  Humanitar- 
ian OpenStreetMap  Team  (H.O.T.)5  was  launched  to  organise  the  many  crisis 
mapping  actions  of  the  OSM  community  and  is  also  in  contact  with  other  rel- 
evant humanitarian  organisations.  This  engagement  attracted  serious  interest 
in  academic  circles  as  well  as  on  the  side  of  humanitarian -aid  organisations 
(Harvard  Humanitarian  Initiative  2011;  United  Nations  Office  for  the  Coor- 
dination of  Humanitarian  Affairs  2013).  One  significant  example  of  the  use  of 
such  information  could  be  attested  in  a more  recent  major  catastrophic  event: 
the  typhoon  Haiyan  that  hit  the  Philippines  in  2013  (Reimer  et  al.  2014).  In  this 
case,  the  international  OSM  community  was  also  very  active  and  collaborated 
in  a coordinated  way  with  humanitarian  organisations  such  as  the  American 


5 Available  at:  http://hot.openstreetmap.org/  [Accessed  February  12th  2014]. 
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Red  Cross  and  UN-OCHA  (Office  for  Coordination  Affairs  of  the  United 
Nations),  for  example,  for  extracting  information  about  elements  at  risk  of  the 
so-called  Critical  Infrastructure  (see  Figure  3),  i.e.  critical  elements  that  must 
particular  attention  in  a disaster  management  such  as  schools,  hospitals,  fuel 
stations  etc.  (Reimer  et  al.  2014;  Schelhorn  et  al.  2014;  Herfort  et  al.  2015). 

However,  while  one  main  advantage  of  OSM  is  that  their  contributors  mainly 
focus  on  their  well-known  local  surroundings  (Goodchild  2007;  Neis  & Zipf 
2012),  Crisis  Maps  originate  largely  from  mappers  who  work  remotely.  There- 
fore, due  to  the  fact  that  OSM  is  a crowdsourced  map  and  that  a main  part 
of  the  data  in  Crisis  Maps  originates  exclusively  from  contributors  that  work 
remotely,  humanitarian -aid  agencies  and  first  responders  have  doubts  about 
the  quality  of  the  OSM  data  and  therefore  sometimes  refrain  from  utilising  it 
(Harvard  Humanitarian  Initiative  2011).  Furthermore,  activity  areas  of  remote 
mappers  generally  lack  official  cross-reference  data,  making  it  difficult  to  apply 
usual  quality  assessment  methods,  which  are  based  on  comparisons  with  ref- 
erence data.  In  this  manner,  an  important  direction  for  future  research  is  to 
develop  methods  for  assessing  and  improving  the  quality  of  the  geo -infor- 
mation produced  by  remote  volunteers,  especially  considering  the  particular 
needs  and  requirements  of  the  field  of  disaster  risk  management. 


Conclusion 

Crowdsourced  geographic  information  (CGI)  holds  a big  potential  not  only  for 
coping  with  the  effects  of  disaster  events,  but  also  for  implementing  preventive 
measures  for  improving  the  resilience  of  urban  areas  against  natural  hazards 
and  extreme  events.  We  presented  and  discussed  three  main  types  of  crowd- 
sourced geo -information  that  can  be  explored  for  this  purpose:  social  media, 
crowd  sensing  and  collaborative  maps. 

CGI  of  these  different  types  can  be  incorporated  into  disaster  risk  manage- 
ment in  many  different  ways.  As  shown  in  the  previous  sections,  the  most 
important  usage  of  CGI  in  this  context  is  improve  situation  awareness  in  the 
monitoring  of  unfolding  events,  i.e.  to  complement  conventional  informa- 
tion sources  with  first-hand  geographic  information  from  the  crowd  shared 
in  social  media,  citizen  sensing  platforms  and/or  collaborative  maps.  In  this 
manner,  it  is  possible  to  get  more  fine-grained  and  up-to-date  spatial  informa- 
tion about  what  is  happening  on  the  ground.  Clearly,  this  information  is  of 
great  value  for  creating  maps  to  support  emergency  agencies  in  disaster  relief, 
both  in  field  missions  and  in  emergency  operation  centres.  Although  CGI  is 
becoming  more  and  more  used  for  this  purpose,  significant  challenges  remain 
in  filtering  and  prioritising  useful  and  valuable  information  amidst  the  large 
stream  of  non-relevant  data.  Since  most  existing  studies  are  still  focused  on  a 
single  source  of  CGI,  the  integration  and  fusion  of  the  different  types  of  CGI 
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with  other  authoritative  data  sources  and  processes  of  emergency  agencies  is  a 
still  underexplored  topic  that  should  be  addressed  in  future  research  efforts  in 
this  area. 

Furthermore,  the  use  of  CGI  in  mitigation  and  preparation  phases  should  be 
emphasised  in  future  studies.  This  could  be  done  for  instance  by  leveraging  ini- 
tial examples  of  using  CGI  from  collaborative  maps  to  support  activities  in  dis- 
aster risk  management,  such  as  in  the  identification  of  critical  infrastructures 
to  support  emergency  planning  (Herfort  et  al.  2015;  Schelhorn  et  al.  2014),  for 
instance  for  performing  evacuation  simulations  (Bakillah  et  al.  2012,  Goetz  & 
Zipf  2012)  and  estimating  the  vulnerability  of  urban  areas  based  on  synthetic 
information  about  the  potentially  affected  population  (Bakillah  et  al.  2014). 
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Abstract 

Routing  and  navigation  web  services  are  becoming  widely  used,  and  make  use 
of  both  commercial  and  VGI  datasets.  It  is  now  becoming  widely  acknowledged 
that  a one  fits  all5  method  of  generating  and  presenting  routes  is  not  applicable. 
In  particular,  the  accessibility  of  places  for  the  mobility  impaired  has  become  a 
key  focus  with  several  services  addressing  topics  such  as  how  accessible  loca- 
tions of  interest  are  and  how  to  best  generate  routes  for  people  who  need  to  con- 
sider additional  factors.  Though  datasources  such  as  OpenStreetMap  (OSM) 
are  well  suited  for  such  topics,  several  issues  including  the  quality  of  the  under- 
lying data  remain.  Through  the  use  of  quality  assessment  tools  it  is  possible 
to  identify  areas  with  inadequate  data  completeness  with  regards  to  the  infor- 
mation needed  for  the  mobility  impaired  and  thus  encourage  the  enrichment 
of  these  areas  through  specialised  tagging  applications.  Such  data  can  then  be 
used  in  routing  and  navigation  services  which  focus  on  ensuring  that  routes 
being  generated  and  presented  fit  the  personal  requirements  of  the  traveller. 
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Introduction 

Routing  and  navigation  services  for  vehicles  and  people  are  based  on  geospa- 
tial data.  Due  to  the  availability  of  GPS  sensors  within  handheld  devices,  sys- 
tems have  been  able  to  localise  themselves  precisely  on  the  road.  These  services 
increasingly  make  use  of  public  transport  data  and  integrate  near  real-time 
traffic  data.  Also,  through  the  development  of  new  algorithms  that  employ 
hierarchical  methods,  routing  has  become  faster,  especially  for  calculating  long 
distance  routes.  Nowadays,  routing  services  are  able  to  provide  routes  based  on 
several  criteria  such  as  distance,  road  type,  and  traffic  to  an  acceptable  degree 
of  accuracy.  In  addition,  further  research  prototypes  have  identified  a num- 
ber of  extra  criteria.  With  the  arrival  of  crowdsourcing  and  VGI  (in  particu- 
lar the  OpenStreetMap  project),  a new  generation  of  route  planning  services 
using  such  services  has  emerged.  As  an  example,  OpenRouteService.org  (Neis 
& Zipf,  2008)  used  OSM  as  data  source  for  deriving  optimal  routes  between 
two  locations  for  different  modes  of  transport  including  car,  several  types  of 
bikes,  and  pedestrians.  The  potential  of  crowdsourced  geographic  information 
for  routing  and  navigation  can  be  highlighted  by  the  fact  that  the  OSM  based 
OpenRouteService1  was  able  to  provide  pedestrian  and  bicycle  routing  across 
several  countries  even  before  Google  offered  these  features.  This  is  because 
of  the  different  way  crowdsourced  information  like  OSM  is  being  collected. 
Volunteers  do  this  on  the  ground  (commonly  on  foot  or  bicycle)  leading  to  a 
higher  representation  of  this  particular  kind  of  data  than  offered  by  commer- 
cial providers  before.  This  particular  richness  of  VGI  also  offers  new  possibili- 
ties for  even  more  specific  information  needs  and  specialised  applications. 

Besides  the  support  of  these  mainstream  route  planning  needs,  routing  ser- 
vices that  serve  people  with  special  requirements  (such  as  wheelchair  routing) 
are  being  designed  and  developed.  However,  in  order  to  provide  the  appropri- 
ate data  required  for  such  services  that  meet  the  specific  information  require- 
ments of  the  respective  users,  new  sources  of  information  are  required.  For 
instance,  in  addition  to  road  features,  a wheelchair  routing  service  would  need 
to  consider  sidewalk  data,  such  as  curbs,  surface  type  and  incline  information, 
to  name  a few.  Muller  et  al.  (2010)  therefore  suggested  several  extensions  to  the 
OSM  tagging  schema  including  new  tags,  as  well  as  identifying  a selection  of 
relevant  already  existing  ones.  This  supported  OSM  mappers  in  the  collection 
of  such  information.  These  tags  have  been  added  to  the  OSM  schema  and  are 
described  in  the  OSM  wiki2.  Such  detailed  information  is  however,  still  not 
generally  available  or  complete  in  the  OpenStreetMap  dataset.  Therefore,  spe- 
cial crowdsourcing  tools  and  services  are  necessary  in  order  to  better  support 
the  collection  of  geospatial  data  relevant  for  routing  and  navigation  of  people 
with  limited  mobility. 


1 http://www.openrouteservice.org 

2 http://wiki.openstreetmap.org/wiki/Wheelchair_routing 
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To  address  the  incompleteness  of  data  it  is  necessary  to  develop  and  pilot- 
test  methods  and  tools  for  collectively  gathering  and  sharing  spatial  informa- 
tion for  improving  accessibility  The  power  of  online  maps  and  mobile  devices 
can  help  foster  an  awareness  of  barriers  for  individuals  with  limited  mobility 
and  encourage  the  removal  of  such  barriers.  As  this  is  highly  relevant  for  an 
increasing  proportion  of  our  society,  the  European  Commission  has  decided 
to  fund  projects  in  this  domain.  One  example  is  the  CAP4 Access3  project. 
The  agenda  of  research  and  development  in  this  field  includes  the  design  and 
implementation  of  tools  and  methods  for  (a)  quality  assessment,  i.e.  checking 
the  completeness  of  OSM  data  with  regard  to  required  information;  b)  tagging, 
i.e.  describing  and  discussing  locations  and  routes  within  the  built  environ- 
ment according  to  their  accessibility;  (c)  route  planning  and  navigation;  (d) 
raising  awareness  and  preparing  effective  measures  at  local  level  for  eliminat- 
ing barriers.  Target  groups  include  people  requiring  enhanced  accessibility, 
grassroots  initiatives  supporting  people  with  disabilities,  policy-makers,  plan- 
ners and  service  providers  with  responsibility  for  the  built  environment,  and 
the  general  public. 


Required  and  existing  services 

Quality  assessment  and  enrichment  of  crowd-sourced  data 

Ensuring  the  quality  of  crowdsourced  data  is  of  particular  importance  in  order 
to  ensure  that  the  results  of  routing  and  navigation  services  offered  are  accu- 
rate and  ultimately  useful.  Generally,  routing  and  navigation  services  for  people 
with  limited  mobility  benefit  from  information  regarding  new  obstacles  pro- 
vided by  involved  communities.  However,  crowdsourced  data  has  significant 
differences  from  traditional  geospatial  data  which  are  often  created  by  spe- 
cifically dedicated  organizations  and  experts,  and  are  generated  according  to 
standardised  structures  and  languages.  Especially  in  the  case  of  routing  tools, 
completeness  of  data  and  spatial  accuracy  is  of  great  concern  to  ensure  proper 
routing.  The  development  of  a data  quality  assessment  component  is  therefore 
a crucial  objective. 

There  are  a number  of  different  geo -data  quality  elements  that  one  might 
need  to  check  before  using  any  kind  of  datasets  in  their  project.  These  ele- 
ments include  positional  accuracy,  attribute  accuracy,  completeness,  logical 
consistency  and  temporal  accuracy  (van  Oort,  2006).  A discussion  on  all  data 
quality  elements  are  out  of  the  scope  of  this  chapter.  However,  as  an  example 
we  provide  information  on  one  of  the  important  data  quality  elements  for 
routing  and  navigation  services  - completeness.  Completeness  is  defined  as 
errors  of  omission  (measure  of  the  absence  of  data),  and  errors  of  commission 


3 http://myaccessible.eu/ 
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(measure  of  the  presence  of  extra  data)  (van  Oort,  2006).  The  completeness  of 
a dataset  can  be  suitable  for  a specific  task  but  not  for  another.  So,  when  com- 
pleteness has  to  be  measured,  the  concept  of  fitness  for  use  comes  in  mind. 

In  order  to  check  the  completeness  of  a dataset  three  tests  could  be  performed: 

• missing  object/line:  here  we  find  missing  information  at  object  level,  whether 
a line  or  polygon  is  missing  in  the  dataset.  This  should  be  done  by  using  a 
reference  dataset  if  possible. 

• missing  attribute : for  those  objects  (point/line/polygon)  that  are  available 
we  need  to  know  which  attributes  are  missing  (based  on  a list  of  attributes 
that  are  important  and  used  by  the  routing  and  navigation  system).  Missing 
attributes  are  counted  as  inconsistencies  and  reported.  This  check  can  be 
performed  by  automated  means  of  intrinsic  data  check,  which  means  that 
only  the  dataset  itself  is  investigated  (Barron  et  al.  2013). 

• missing  value : From  those  existing  attributes,  some  may  be  incomplete  in 
terms  of  missing  values.  Here  we  check  those  attributes/ fields  that  lack  value 

There  are  several  tools  that  could  be  used  for  checking  the  completeness  of  an 
OpenStreetMap  dataset.  For  example,  OSMatrix4  is  a tool  for  visualising  map- 
ping progress/quality  on  various  metrics  (Roick  et  al.  2011,  2012).  By  using 
OSMatrix,  one  can  calculate  the  number  of  certain  object  features  in  OSM  (e.g. 
sidewalks)  at  various  timestamps.  For  example,  Figure  1 shows  a snapshot  of 
visual  and  statistical  information  regarding  the  total  number  of  sidewalk  infor- 
mation (tags  in  OpenStreetMap  related  to  sidewalk  information)  for  a selected 
region  in  Heidelberg,  in  Jan  2016.  The  sample  area  is  divided  into  hexagonal 
cells  with  the  size  of  1 km.  Each  cell  shows  a value  representing  the  aggrega- 
tion of  the  total  number  of  tags  in  OSM  related  to  sidewalk  attributes  in  that 
hexagon.  OSMatrix  could  also  be  used  in  order  to  derive  statistical  and  visual 
information  regarding  the  completeness  of  OSM  data  at  the  attribute  level  (e.g. 
sidewalk  width,  incline,  etc.). 

As  another  example  of  tools,  OSM  Quality  Assurance  Editor5  can  be  used 
to  understand  the  completeness  of  certain  object  features  in  OSM  data  using 
an  object-based  approach.  Figure  2 shows  a region  in  Heidelberg  (Germany) 
where  sidewalk  information  is  missing  (the  road  features  contain  no  informa- 
tion regarding  the  presence  or  absence  of  sidewalks  attached  to  the  road).  This 
information  is  given  per  object,  meaning  that  one  could  select  an  object  and 
view  its  properties  as  opposed  to  the  provision  of  an  aggregated  region-wide 
value.  This  is  a large  benefit  in  comparison  to  OSMatrix  is  in  that  tool  the  sta- 
tistical information  is  aggregated  and  provided  for  each  cell,  while  information 
regarding  route  objects  inside  the  cells  cannot  be  realised. 


4 http://alborz.geog.uni-heidelberg.de/osmatrix/ 

5 http://editor.osmsurround.org/ 
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Figure  1:  Mapping  the  completeness  statistics  of  total  number  of  sidewalk  information  using  OSMatrix. 
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Figure  2:  An  example  of  selected  road  features  that  have  no  information  regard- 
ing existence  of  sidewalk. 


In  order  to  provide  an  effective  routing  and  navigation  service  for  people 
with  limited  mobility,  it  is  crucial  to  have  data  regarding  sidewalk  features  and 
their  attributes  such  as  surface  texture,  width,  incline,  etc.  The  quality  assess- 
ment tools  can  aid  in  the  understanding  of  the  level  of  incompleteness  of  such 
information  within  OpenStreetMap  data,  as  well  as  identifying  the  places 
where  such  information  is  missing.  In  order  to  better  inform  the  user  about 
the  potentially  non-perfect  data  for  wheelchair  routing,  Neis  (2014)  developed 
an  initial  prototype  of  a reliability  index  that  attempts  to  measure  some  aspects 
regarding  the  information  quality  of  the  selected  route. 

With  regards  to  data  availability,  Figure  2 shows  that  most  road  and  streets 
in  the  selected  region  in  Heidelberg  are  missing  sidewalk  information.  There- 
fore, in  order  to  collect  and  enrich  OSM  with  sidewalk  information  a tagging 
system  is  suggested.  Tagging  systems  are  developed  to  support  the  collection 
of  user-generated  data.  In  the  case  of  the  restricted  mobility  topic,  this  data 
focusses  on  the  accessibility  of  places,  points  of  interest  and  roads.  The  collec- 
tive tagging  approach  has  already  been  applied  within  the  accessibility  theme 
through  the  Wheelmap6  platform.  Wheelmap  is  a map  for  wheelchair-accessi- 
ble places.  Locations  are  rated  and  portrayed  according  to  a traffic  light  system 
based  on  their  accessibility  status  (e.g.  accessibility  of  restaurants  for  people  on 
wheelchairs).  One  of  the  disadvantages  of  Wheelmap  with  regards  to  collecting 
accessibility  information  for  improving  routing  and  navigation  of  people  with 


6 http://wheelmap.org/ 
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restricted  mobility  is  that  it  is  not  capable  of  collecting  accessibility  information 
for  linear  objects  such  as  sidewalks. 

Since  sidewalk  information  is  crucial  to  routing  services  that  need  to  provide 
the  most  accessible  route  from  one  location  to  another,  other  tagging  systems 
should  be  used  that  provide  the  capability  of  enriching  sidewalk  information 
such  as  (for  example)  availability  of  a sidewalk,  sidewalk  width,  incline  and 
surface  texture.  For  this  purpose,  any  OSM  editor  that  could  be  used  for  editing 
line  features  (e.g.  Vespucci7  OSM  editor,  JOSM,  etc.)  can  be  employed. 


Routing 

A routing  service  tailored  to  special  needs  is  a core  service  to  improve  the 
mobility  of  people  with  various  types  of  disabilities  (Neis  and  Zielstra  2014). 
A routing  system  mostly  consists  of  two  core  components:  a graph  network 
representing  the  underlying  street  datasets,  and  a routing  engine  that  uses  this 
graph  to  generate  feasible  routes. 

There  are  a number  of  available  routing  services  described  on  the  OSM  Wiki 
pages8  that  make  use  of  OSM  data.  However,  of  the  13  route  services  docu- 
mented there  only  three  currently  provide  the  functionality  to  be  extended  for 
to  address  specialised  requirements,  such  as  wheelchair  routing.  These  three 
are  OpenRouteService,  Routino9  and  OpenTripPlanner10.  OpenRouteService. 
org  (ORS;  Neis  & Zipf  2008)  is  built  according  to  open  standards  from  the 
Open  Geospatial  Consortium  (OGC)  meaning  that  it  can  easily  be  integrated 
in  other  applications  or  regional  web  sites.  In  order  to  use  the  crowdsourced 
data  from  OSM  for  wheelchair  routing,  applications  such  as  ORS  have  been 
(and  will  need  to  further  be)  extended  so  that  they  can  be  used  by  persons 
with  limited  mobility  with  various  profiles  and  parameters.  Figure  3 shows  the 
result  of  a route  plan  where  a difference  between  the  pedestrian  profile  and  the 
wheelchair  profile  occurs  due  to  a pedestrian  bridge  that  is  only  accessible  via 
steps  which  are  not  feasible  for  wheelchair  users.  A first  prototype  for  wheel- 
chair routing  based  on  OpenRouteService  was  developed  earlier  by  Muller 
et  al.  (2010)* 11. 

The  main  challenge  to  serve  special  route  planning  requirements  is  the  need 
for  very  specific  data  (e.g.  sidewalk  data),  which  is  not  available  for  large  areas, 
e.g.  country-wide.  This  especially  includes  data  about  sidewalks  (e.g.  width, 
surface,  smoothness,  incline  and  existence  of  sidewalks)  and  also  the  posi- 
tion and  height  of  sloped/dropped  curbs.  Such  data  is  not  usually  published  in 
authoritative  products  due  to  their  scope,  even  though  the  data  is  often  available 


7 http://wiki.openstreetmap.org/wiki/Vespucci 

8 http://wiki.openstreetmap.org/wiki/Routing/online_routers 

9 http://www.routino.org/ 

10  http : / / www.  op  entripplanner.  org/ 

11  http://rollstuhlrouting.de 
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Figure  3:  Showing  a routing  situation  where  the  wheelchair  profile  leads  to  a different  route. 
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within  the  organisation.  Even  if  these  datasets  were  made  available  however, 
the  heterogeneity  between  data  structures  would  make  integration  problem- 
atic. Considering  this  situation,  VGI  becomes  an  important  data  source  as  local 
knowledge  from  the  users  can  be  harnessed.  Care  needs  to  be  taken  however  to 
provide  feedback  (in  the  form  of  changes  to  existing  services)  to  the  volunteers 
to  ensure  that  their  motivation  is  maintained.  In  terms  of  the  sidewalks  them- 
selves, the  OSM  community  has  decided  that  within  their  dataset,  apart  from 
defined  exceptions,  sidewalks  should  not  be  represented  by  separate  features, 
but  instead  through  attributes  held  by  their  associated  roads. 

Another  challenge  relating  to  routing  (in  particular  for  pedestrians)  is  that 
existing  routing  algorithms  can  only  process  data  that  is  provided  in  a line  for- 
mat. Within  OSM,  many  traversable  surfaces  (such  as  city  squares  and  parks) 
are  represented  as  polygon  areas  which  most  routing  algorithms  simply  do  not 
know  how  to  handle.  Several  methods  could  be  implemented  to  generate  linear 
representations  of  these  spaces  such  as  using  the  polygon  outline  (figure  4a), 
or  generating  internal  linear  structures  through  grids  (figure  4c),  skeletons 
(figure  4d)  or  lines-of-sight. 


Navigation 

Although  routing  systems  provide  a valuable  service  with  regards  to  getting 
around  unknown  environments,  the  determination  of  a route  is  only  one  step  in 
an  overall  wayfinding  process.  As  well  as  being  able  to  calculate  a route  through 
space  that  is  suitable  for  an  individuals  preferences,  this  information  needs  to 
be  conveyed  in  a format  that  allows  them  to  successfully  traverse  the  intended 
route.  Many  navigation  systems  exist  which  attempt  to  provide  direction  instruc- 
tions to  travellers,  often  in  the  form  of  distance  and  street  names.  It  is  well  docu- 
mented that  when  describing  directions  (particularly  for  pedestrians)  landmarks 
form  an  essential  component  of  the  descriptive  (Duckham  et  al.  2010,  Winter  et 
al.  2005).  Proper  selection  of  landmarks  in  navigation  of  people  with  restricted 
mobility  is  challenging  since  it  requires  special  considerations  compared  to  nor- 
mal pedestrians.  For  instance,  the  fact  that  people  on  wheelchair  have  less  vis- 
ibility than  those  who  are  standing  is  an  issue  that  must  be  considered. 

A number  of  methods  have  been  identified  that  extract  landmarks  that  can 
be  used  for  navigational  instructions,  some  of  which  make  use  of  VGI  datasets 
(such  as  Drager  and  Kroller  2012).  Key  considerations  when  producing  such 
methods  include  the  scalability,  performance  in  terms  of  determination  speed, 
and  the  suitability  of  the  actual  landmarks  identified. 

With  regards  to  navigation  services  for  mobility  impaired  users,  it  is  impor- 
tant to  take  into  account  the  individual  requirements  of  each  user  and  then  feed 
these  through  to  the  underlying  route  generation  system.  The  system  should 
also  allow  users  to  report  any  obstacles  that  they  encounter  which  were  not 
included  in  the  route  generation  due  to  the  incompleteness  of  the  underlying 
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a)  Planned  route  using  the  outline  of 
an  open  space  area 


b)  Desired  route  going  through  the 
open  space  area 


c)  Grid  (orange)  based  approach  as 
a subgraph  of  the  polygon  (light 
blue) 


d)  Skeleton  (orange)  based  approach 
as  a subgraph  of  the  polygon  (light 
blue) 


Figure  4:  Different  routing  approaches  through  polygons  (grid,  skeleton,  line 
of  sight). 


dataset.  This  in  turn  not  only  empowers  the  user  in  terms  of  them  making  a 
difference  to  the  system,  but  it  also  increases  the  overall  completeness  of  the 
source  dataset. 

One  of  the  main  challenges  posed  currently  is  how  VGI  and  social  media 
can  be  used  to  identify  suitable  landmarks  that  can  be  used  in  navigational 
instructions.  Not  only  does  this  require  the  fusion  of  freely  available  data  from 
various  sources  (i.e.  OSM,  Foursquare  and  Twitter),  it  also  generates  technical 
problems  such  as  providing  a fast  lookup  service  for  possible  candidates  from 
a population.  Overall,  what  we  want  is  a system  that  can  quickly  provide  an 
instruction  for  any  route  similar  to  “just  after  xxx,  turn  left”  where  m is  the 
most  salient  object  in  relation  to  the  turning  point. 
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Discussion  and  Conclusion 

Routing  and  navigation  services  are  becoming  more  personalized  and  near 
real-time,  by  recommending  routes  that  take  into  account  various  preferences 
and  up-to-date  information.  This  has  long  been  argued  for  in  the  field  of  adap- 
tive and  personalized  Location  Based  Services  (LBS)  (Malaka  and  Zipf  2000, 
Zipf  2002).  In  contrast  to  former  times,  such  developments  can  now  be  realised 
much  easier  due  to  the  wealth  of  rich  data  that  is  made  available  through  VGI 
and  the  Social  Web. 

In  this  chapter,  the  potentials  of  using  crowdsourced  geo -information  for 
routing  and  navigation  with  a focus  on  people  with  restricted  mobility  were 
presented.  Although  the  completeness  of  OpenStreetMap  data  with  regard  to 
specific  information  for  those  user  groups  such  as  relevant  barriers  or  details  on 
sidewalks  still  is  low,  there  are  developments  under  way  to  improve  the  situa- 
tion both  in  an  automated  way  as  well  as  by  harnessing  the  power  of  the  crowd, 
so  we  can  expect  more  and  more  areas  where  enough  information  is  available. 
We  explored  the  OSMatrix  Web-Service  and  presented  how  such  a portal  visu- 
alizing intrinsic  quality  indicators  can  be  used  in  order  to  help  understand  the 
data  and  attribute  availability  for  certain  features  (e.g.  sidewalks).  Furthermore, 
we  see  the  need  for  further  research  in  the  area  of  (semi-)  automated  data  inte- 
gration from  different  user  generated  data  sources  in  order  to  further  enrich 
the  routing  graph  with  relevant  objects  and  attributes. 

As  an  example  related  to  both  pedestrian  and  wheelchair  navigation,  we  dis- 
cussed the  problems  on  deriving  routing  graphs  dealing  with  with  open  spaces  (e.g. 
city  squares  and  parks).  For  this  possible  ideas  for  generating  routing  graph  were 
explained.  These  are  further  investigated  and  evaluated  with  respect  to  efficiency 
and  quality  in  our  current  research.  It  was  discussed  that  while  routing  services 
provide  valuable  information,  they  are  only  the  first  step  in  the  overall  wayfinding 
process.  A navigation  service  is  an  additional  component  that  receives  routing 
information  as  input,  processes,  translates  and  communicates  the  information  in 
a format  that  allows  the  individuals  to  traverse  the  relevant  route.  For  this,  the 
derivation  and  selection  of  landmarks  from  OSM  and  other  crowdsourcing  plat- 
forms is  an  interesting  research  topic.  Regarding  navigation  services  for  mobility 
impaired  users,  it  was  concluded  that  individual  requirements  of  each  user  need 
to  be  collected  and  considered  in  the  process  of  route  generation.  Extraction  of 
relevant  information  from  both  the  users’  context  as  well  as  the  environment  from 
VGI  and  crowdsourced  data  could  lead  to  a richer  real-time  routing  and  naviga- 
tion service  that  could  benefit  from  the  availability  of  temporally  detailed  road 
network  data  with  conditions  such  as  speed  or  other  detailed  semantic  attributes. 
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Abstract 

Rapid  technological  development  and  the  introduction  of  smart  services  make 
it  possible  for  modern  cities  to  offer  an  enhanced  perception  of  city  life  for  their 
inhabitants.  For  instance,  a smart  timetable  service  of  the  city’s  public  trans- 
portation lines  updated  in  real-time  can  decrease  unnecessary  waiting  times  at 
stops  and  increase  the  efficiency  of  travel  planning.  However,  the  implementa- 
tion of  such  a service  in  a traditional  way  requires  the  deployment  and  mainte- 
nance of  some  costly  sensing  and  tracking  infrastructure.  Fortunately,  for  this 
purpose  mobile  crowdsensing  can  be  a viable  and  almost  free  of  charge  alterna- 
tive. In  this  case,  the  crowd  of  passengers  and  their  mobile  devices  are  used  to 
gather  data. 

In  this  chapter,  we  place  emphasis  on  the  introduction  of  a crowdsensing 
based  smart  timetable  service,  which  has  been  developed  as  a prototype  smart 
city  application.  The  front-end  interface  of  this  service  is  called  Trafficlnfo.  It  is 
a simple  and  easy-to-use  Android  application  which  visualizes  public  transport 
information  of  the  given  city  on  Google  Maps  in  real-time.  The  live  updates 
of  transport  schedule  information  rely  on  the  automatic  stop  event  detection 
of  public  transport  vehicles.  Trafficlnfo  is  built  upon  an  Extensible  Messaging 
and  Presence  Protocol  (XMPP)  based  communication  framework  which  was 
designed  to  facilitate  the  development  of  crowd  assisted  smart  city  applications. 
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The  chapter  introduces  this  generic  framework  shortly,  then  describes  the  pro- 
totype smart  timetable  service. 
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Introduction 

More  and  more  modern  cities  offer  smart  services,  which  are  services  using 
modern  infrastructure  and/or  providing  value  added  functions  to  ease  the 
everyday  life  of  inhabitants.  Unfortunately,  the  traditional  way  of  introducing 
a new  service  usually  implies  a huge  investment  to  deploy  and  maintain  the 
necessary  background  infrastructure.  One  of  the  most  popular  city  services  is 
public  transportation.  Maintaining  and  continuously  improving  such  a service 
are  imperative  in  modern  cities.  However,  the  implementation  of  even  a simple 
feature  that  extends  the  basic  service  functions  can  be  expensive.  For  instance, 
let  us  consider  the  replacement  of  static  timetables  with  a live  public  trans- 
port information  service  updated  in  real-time.  It  requires  the  deployment  of  a 
vehicle -tracking  infrastructure  consisting  of  among  others  GPS  sensors,  com- 
munication infrastructure,  back-end  systems  and  front-end  user  interfaces, 
which  can  be  a cost  intensive  investment. 

An  alternative  approach  to  collect  real-time  tracking  data  is  exploiting  the 
power  of  the  crowd  via  participatory  sensing  or  often  called  mobile  crowdsens- 
ing, which  does  not  call  for  such  an  investment.  In  this  scenario  (see  Figure  1), 
the  passengers  mobile  devices  and  their  built-in  sensors,  or  the  passengers  them- 
selves via  reporting  incidents,  are  used  to  generate  the  monitoring  data  for  vehi- 
cle tracking.  Moreover,  they  send  instant  route  information  to  the  service  pro- 
vider in  real-time.  The  service  provider  then  aggregates,  cleans,  analyzes  the  data 
gathered,  and  derives  and  disseminates  the  real-time  updates.  The  sensing  task 
is  carried  out  by  the  built-in  and  ubiquitous  sensors  of  the  smartphones  either  in 
participatory  or  opportunistic  way  depending  on  whether  the  user  is  involved  or 
not  in  data  collection.  Every  traveler  can  contribute  to  this  data-harvesting  task. 
Thus,  passengers  waiting  for  a ride  at  the  stop  can  report  the  line  number  with  a 
timestamp  of  every  arriving  public  transport  vehicle  during  the  waiting  period. 
On  the  other  hand,  onboard  passengers  can  be  used  to  gather  and  report  actual 
position  information  of  the  moving  vehicle  and  detect  halt  events  at  the  stops. 

In  this  chapter,  we  focus  on  the  introduction  of  a crowdsensing  based  smart 
timetable  service,  which  has  been  developed  as  a prototype  smart  city  applica- 
tion. The  front-end  interface  of  this  service,  called  Trafficlnfo,  is  a simple  and 
easy-to-use  Android  application.  It  visualizes  live  public  transport  informa- 
tion of  the  given  city  on  Google  Maps.  Trafficlnfo  is  built  upon  an  Extensible 
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GTFS  Database 

Figure  1:  Live  public  transport  information  service  based  on  mobile 
crowdsensing. 


Messaging  and  Presence  Protocol  (XMPP)  (Saint- Andre  2011)  based  commu- 
nication framework  (Szabo  & Farkas  2013).  This  framework  was  designed  to 
facilitate  the  development  of  crowd  assisted  smart  city  applications  (and  will 
be  introduced  shortly  in  the  upcoming  section).  Following  the  publish/sub- 
scribe  (pub/sub)  communication  model  the  passengers  subscribe  in  Trafficlnfo 
to  traffic  information  channels  according  to  their  interest.  These  channels  are 
dedicated  to  different  public  transport  lines  or  stops.  Hence,  the  passengers  are 
informed  about  the  live  public  transport  situation.  For  instance,  they  can  see 
the  actual  vehicle  positions,  deviation  from  the  static  timetable,  crowdedness 
information,  travel  conditions,  etc. 

To  motivate  user  participation  in  data  collection  an  initial  service  is  offered 
to  the  passengers,  which  is  a static  public  transportation  timetable.  It  is  built  on 
the  General  Transit  Feed  Specification  (GTFS)  (Google  Inc.  2006)  based  transit 
schedule  data  and  provided  by  public  transport  operators.  GTFS  is  the  best 
practice  for  providing  such  information,  and  is  available  in  350  cities  attracting 
more  than  6.5  million  users.  According  to  the  GTFS  developer  page,  currently 
GTFS  data  is  available  for  879  transit  agencies  worldwide.  Trafficlnfo  basically 
presents  this  static  timetable  information  to  the  users  which  is  then  updated  in 
real-time,  if  appropriate  crowdsensed  data  is  available.  To  this  end,  the  appli- 
cation collects  the  following  information:  position  data;  the  timestamped  halt 
events,  detected  automatically,  of  the  public  transport  vehicles  at  the  stops;  sim- 
ple annotation  data  entered  by  the  user,  such  as  reports  on  crowdedness  and 
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travel  conditions.  After  analyzing  the  data  gathered  live  updates  are  generated 
and  Trafficlnfo  refreshes  the  static  information  with  these  updates. 

The  rest  of  this  chapter  is  structured  as  follows.  A quick  overview  of  crowd 
assisted  transit-tracking  systems  is  provided  in  the  next  section.  Then,  our 
generic  framework  to  facilitate  the  development  of  crowdsensing  based  ser- 
vices is  introduced  shortly.  Next,  we  describe  the  prototype  smart  timetable 
service.  Finally,  we  give  a short  summary. 

Crowd  Assisted  Transit- tracking  Systems  and  Approaches 

This  section  gives  an  overview  of  crowd  assisted  transit-tracking  solutions. 

Moovit  (Moovit  Developers  2014)  is  meant  to  be  a live  transit  app  on  the 
market  providing  real-time  information  about  public  transportation.  Moovit 
has  been  successful  only  in  those  cities  where  it  has  already  a mass  of  users,  just 
like  in  Paris,  and  not  successful  in  cities  where  its  user  base  is  low,  e.g.  in  Buda- 
pest. In  order  to  create  a sufficiently  large  user  base  Moovit  provides,  besides 
live  data,  schedule  based  public  transportation  information  as  an  initial  service, 
too.  The  source  of  this  information  is  the  company  who  operates  the  public 
transportation  network.  Moovit  partially  relies  on  GTFS. 

Several  other  mobile  crowdsensing  based  transit-tracking  ideas  have  been 
published  recently.  For  instance,  Zhou,  Zheng  and  Li  (2012)  propose  a bus 
arrival  time  prediction  system  based  on  bus  passengers  participatory  sensing. 
The  proposed  system  uses  movement  statuses,  audio  recordings  and  mobile  cell 
tower  signals  to  identify  the  vehicle  and  its  actual  position.  Thiagarajan  et  al. 
(2010)  propose  a method  for  transit  tracking  using  the  collected  data  of  the 
accelerometer  and  the  GPS  sensor  on  the  users  smartphone.  Bedogni,  Di  Felice 
and  Bononi  (2012)  use  smartphone  sensors  data  and  machine  learning  tech- 
niques to  detect  motion  type,  e.g.  traveling  by  train  or  by  car.  EasyTracker  (Bia- 
gioni  et  al.  2011)  provides  a low  cost  solution  for  automatic  real-time  transit 
tracking  and  mapping  based  on  GPS  sensor  data  gathered  from  mobile  phones, 
which  are  placed  in  transit  vehicles.  It  offers  arrival  time  prediction,  as  well. 

These  approaches  focus  on  the  data  to  offer  enriched  services  to  the  users. 
The  focus  of  our  work,  in  turn,  is  on  how  to  introduce  such  enriched  services 
incrementally.  Namely,  how  one  can  create  an  architecture  and  service  model, 
which  allows  incremental  introduction  of  live  updates  from  participatory  users 
over  static  services  that  are  available  in  competing  approaches.  Hence,  our 
work,  in  essence,  complements  the  above  ones. 

Generic  Framework  for  Crowdsensing  Based  Smart  City 
Applications 

In  this  section,  our  generic  framework  (Szabo  & Farkas  2013)  to  aid  the  devel- 
opment of  crowdsensing  based  smart  city  applications  is  described  shortly.  This 
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framework  is  based  on  the  XMPP  publish/subscribe  architecture.  Trafficlnfo  is 
implemented  on  top  of  this  framework. 

Communication  Model 

XMPP  (Saint- Andre  2011)  is  an  open  technology  for  real-time  communica- 
tion using  Extensible  Markup  Language  (XML)  message  format.  XMPP  allows 
sending  of  small  information  pieces  from  one  entity  to  another  in  quasi  real- 
time. It  has  several  extensions,  like  multi-party  messaging  or  the  notification 
service.  The  latter  realizes  a publish/subscribe  (pub/sub)  communication 
model,  where  publications  sent  to  a node  are  automatically  multicast  to  the 
subscribers  of  that  node.  This  pub/sub  communication  scheme  fits  well  with 
most  of  the  mobile  crowdsensing  based  applications.  In  these  applications,  the 
users  mobile  devices  are  used  to  collect  data  about  the  environment  (publish) 
and  the  users  consume  the  services  updated  on  the  basis  of  the  collected  data 
(subscribe). 

Hence,  we  use  XMPP  and  its  publish/sub  scribe  communication  model  in  our 
generic  framework  to  implement  interactions.  In  this  model,  we  defined  three 
roles,  like  Producer , Consumer  and  Service  Provider  (see  Figure  2).  These  enti- 
ties interact  with  each  other  via  the  core  service,  which  consists  of  event  based 
pub/sub  nodes. 

Producer:  The  Producer  acts  as  the  original  information  source  in  the  model 
producing  raw  data  streams  and  plays  a central  role  in  data  collection.  He  is  the 
user  who  contributes  his  mobile  s sensor  data. 

Consumer:  The  Consumer  is  the  beneficiary  of  the  provided  service(s).  He 
enjoys  the  value  of  the  collected,  cleaned,  analyzed,  extended  and  disseminated 
information.  The  user  is  called  as  Prosumer , when  he  acts  in  the  service  as  both 
Consumer  and  Producer  at  the  same  time. 


j Producer j 

/ 1 publisfi 

j Producer j- 


^ConsumerJ 

-fr^Consumer^ 
^Consumer^ 

raw  information  nodes  value  added  information  nodes 


Figure  2:  Crowdsensing  model  based  on  publish/subscribe  communication. 
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Service  Provider:  The  Service  Provider  introduces  added  value  to  the  raw 
data  collected  by  the  crowd.  Thus,  he  intercepts  and  extends  the  information 
flow  between  Producers  and  Consumers.  A Service  Provider  can  play  several 
roles  at  the  same  time,  as  he  collects  (Consumer  role),  stores  and  analyzes  Pro- 
ducers’ data  to  offer  (Service  Provider  role)  value  added  service.  Moreover, 
multiple  Service  Providers  can  act  concurrently  and  offer  different  value  added 
services  to  different  Consumers. 

In  the  model,  depicted  in  Figure  2,  Producers  are  the  source  of  original  data 
by  sensing  and  monitoring  their  environment.  They  publish  (marked  by  arrows 
with  empty  arrowhead)  the  collected  information  to  event  nodes  (raw  infor- 
mation nodes  are  marked  by  blue  dots).  On  the  other  hand,  Service  Provid- 
ers intercept  the  collected  data  by  subscribing  (marked  by  arrows  with  black 
arrowhead)  to  raw  event  nodes  and  receiving  information  in  an  asynchronous 
manner.  They  extend  the  crowdsensed  data  with  their  own  information  or 
extract  cleaned-up  information  from  the  raw  data  to  introduce  added  value 
to  Consumers.  Moreover,  they  publish  their  service  to  different  content  nodes. 
Consumers  who  are  interested  in  the  reception  of  the  added  value/service  just 
subscribe  to  the  appropriate  content  node(s)  and  collect  the  published  infor- 
mation also  in  an  asynchronous  manner. 


Framework  Architecture 

This  model  can  be  directly  mapped  to  the  XMPP  publish/subscribe  model  as 
follows  (see  Figure  3): 


Figure  3:  Mobile  crowdsensing  - the  publish/subscribe  value  chain  using  XMPP. 
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Service  Providers  establish  raw  pub/sub  data  nodes,  which  gather  Producers’ 
data,  for  the  services  they  offer. 

• Consumers  can  freely  publish  their  collected  data  to  the  corresponding 
nodes  with  appropriate  node  access  rights,  too.  However,  only  the  owner  or 
other  affiliated  Consumers  can  retrieve  this  information. 

• Producers  can  publish  the  collected  data  or  their  annotations  to  the  raw 
data  nodes  at  the  XMPP  server  only  if  they  have  appropriate  access  rights. 

• Service  Providers  collect  the  published  data  and  introduce  such  a service 
structure  for  their  added  value  via  the  pub/sub  subscription  service,  which 
makes  appropriate  content  filtering  possible  for  their  Consumers. 

• Prosumers  publish  their  sensor  readings  or  annotations  into  and  retrieve 
events  from  XMPP  pub/sub  nodes. 

• Service  Providers  subscribed  to  raw  pub/sub  nodes  collect,  store,  clean  up 
and  analyze  data  and  extract/derive  new  information  introducing  added 
value.  This  new  information  is  published  into  pub/sub  nodes  on  the  other 
side  following  a suitable  structure. 

The  pub/sub  service  node  structure  can  benefit  from  the  aggregation  feature  of 
XMPP  via  using  collection  nodes,  where  a collection  node  will  see  all  the  infor- 
mation received  by  its  child  nodes.  Note,  however,  that  the  aggregation  mecha- 
nism of  an  XMPP  collection  node  is  not  appropriate  to  filter  events.  Hence,  the 
Service  Provider  role  has  to  be  applied  to  implement  scalable  content  aggrega- 
tion. Figure  3 shows  XMPP  aggregations  as  dark  circles  at  the  container  node 
while  empty  circles  with  dashed  lines  represent  only  logical  containment  where 
intelligent  aggregation  is  implemented  through  the  service  logic. 

Smart  Timetable  Service 

In  this  section,  the  architecture  of  the  prototype  smart  timetable  service  is 
delineated  first,  then  Trafficlnfo,  its  front-end  Android  interface  together  with 
the  developed  automatic  stop  event  detector  is  described. 

Service  Architecture 

The  prototype  smart  timetable  service  architecture  has  two  main  building 
blocks,  such  as  the  generic  crowdsensing  framework  described  in  the  previ- 
ous section  and  the  front-end  application  called  Trafficlnfo  (see  Figure  4).  The 
framework  can  be  divided  into  two  parts,  a standard  XMPP  server  and  a GTFS 
Emulator  with  an  Analytics  module. 

XMPP  Server 

The  XMPP  server  maps  the  public  transport  lines,  stored  in  GTFS  (Google 
Inc.  2006)  format,  to  a hierarchical  pub/sub  channel  structure.  Thus,  the  GTFS 
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Figure  4:  Smart  timetable  service  architecture. 
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database  is  turned  into  an  XMPP  pub/sub  node  hierarchy.  This  node  structure 
facilitates  searching  and  selecting  transit  feeds  according  to  user  interest.  The 
pub/sub  node  model  for  content  filtering  in  a transport  information  feed  is 
depicted  in  Figure  5. 

The  root  of  the  pub/sub  tree  is  the  Agency  node  referring  to  the  public  trans- 
port operator.  Transit  information  and  real-time  event  updates  are  handled  in 
the  Trip  nodes  at  the  leaf  level.  The  inner  nodes  in  the  node  hierarchy  contain 
only  persistent  data  and  references  relevant  to  the  trips.  The  users  can  access 
the  transit  data  via  two  ways,  based  on  Routes  or  Stops.  When  the  user  wants 
to  see  a given  trip  (vehicle)  related  traffic  information  the  route  based  filtering 
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is  applied.  On  the  other  hand,  when  the  forthcoming  arrivals  at  a given  stop 
(location)  are  of  interest,  the  stop  based  filtering  is  the  appropriate  access  way. 

For  instance,  the  leaf  node  with  trip  ID  T>KK-Routes-3040-Inbound-A87757’ 
(cf.  the  bracketed  labels  in  the  nodes  of  Figure  5)  handles  transit  feed  and  its 
real-time  updates.  It  is  related  to  Trip  2 in  the  inbound  direction  and  belongs  to 
Route  A of  Agency  BKK  (operator  at  Budapest,  Hungary).  On  the  other  hand, 
node  ‘BKK-Routes-3040’  stores  persistent  transit  information  with  regard  to 
Route  A (e.g.  route  name,  short  name,  stops,  head-signs).  References  to  all  the 
currently  active  inbound  trips  are  found  in  node  CBKK-Routes-3040-Inbound’. 
Similarly,  node  CBKK-Stops-F0108’  stores  persistent  data  with  regard  to  the 
given  stop  (e.g.  stop  name,  GPS  coordinates)  and  lists  the  routes  this  stop  is 
part  of.  Furthermore,  the  trip  ID  of  every  active  trip  is  listed  in  the  route  node. 

GTFS  Emulator,  Analytics  Module 

The  GTFS  Emulator  provides  the  static  timetable  information,  if  it  is  available, 
as  the  initial  service.  It  basically  uses  the  officially  distributed  GTFS  database 
of  the  public  transport  operator  of  the  given  city.  However,  it  also  relies  on 
another  data  source,  which  is  OpenStreetMap  (OSM)  (Haklay  & Weber  2008), 
a crowdsourcing  based  mapping  service.  In  OSM  maps,  users  have  the  possibil- 
ity to  define  terminals,  public  transportation  stops  or  even  public  transporta- 
tion routes.  Thus,  the  OSM  based  information  is  used  to  extend  and  clean  the 
information  coming  from  the  GTFS  source.  The  resulted  data  set  reflects  more 
accurately  the  actual  situation  in  the  given  territory  because  the  OSM  data  is 
updated  more  frequently  than  the  GTFS  data  set. 

The  Analytics  module  is  in  charge  of  the  business  logic  offered  by  the  service, 
e.g.  deriving  crowdedness  information  or  estimating  the  time  of  arrivals  at  the 
stops  from  the  data  collected  by  the  crowd. 

Front-end  Application 

The  front-end  application,  called  Trafhclnfo,  handles  the  subscription  to  the 
pub/sub  channels,  collects  sensor  readings,  publishes  events  to  and  receives 
updates  from  the  XMPP  server,  and  visualizes  the  received  information. 

Trafficlnfo 

Trafhclnfo  has  four  main  functions,  such  as  visualization,  information  sharing, 
sensing  and  stop  event  detection.  These  functions  are  discussed  below. 

Visualization 

Most  of  the  users  benefit  from  the  visualization  capability  of  Trafhclnfo  that 
visualizes  public  transport  vehicle  movements  on  a city  map.  An  example  of 
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this  primary  function  can  be  seen  on  Figure  6a  displaying  trams  of  line  1,  4, 
6 and  buses  of  line  7 and  86  on  the  Budapest  map  in  Hungary  The  depicted 
vehicles  can  be  filtered  to  given  routes.  The  icon  of  a vehicle  reflects  various 
attributes,  such  as  the  number,  progress  or  crowdedness  of  the  specific  vehicle. 
Clicking  on  a vehicles  icon  a popup  shows  all  known  information  about  that 
specific  vehicle. 


Information  Sharing 

The  second  function  serves  for  information  sharing.  Passengers  can  share  their 
observations  regarding  the  vehicles  they  are  currently  riding.  Figure  6b  shows 
the  feedback  screen  that  is  used  to  submit  reports.  The  feedback  information 
is  spread  out  using  the  framework  and  displayed  on  the  devices  of  other  pas- 
sengers, who  might  be  interested  in  it.  It  is  up  to  the  user  what  information  and 
when  he  wants  to  submit. 


Sensing 

The  third  function  is  collecting  smartphone  sensor  readings  without  user  inter- 
action, which  is  almost  invisible  for  the  user.  It  is  done  automatically  in  the 
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background,  the  only  thing  the  user  has  to  do  is  to  start  Trafficlnfo.  User  positions 
are  reported  periodically  and  are  used  to  determine  the  vehicles  position  the  pas- 
senger is  actually  traveling  on.  In  order  to  create  the  link  between  the  passenger 
and  the  vehicle,  the  movement  of  the  user  is  identified  through  his  activities. 
To  this  end  various  sensors  are  used,  e.g.  accelerometer,  and  the  timestamped 
stop  events  of  the  vehicles  are  deducted.  The  duration  between  the  detected  stops 
coupled  with  GPS  coordinates  identifies  the  route  segment,  which  the  user  actu- 
ally rides.  With  regard  to  the  energy  consumption  of  the  sensor  readings  we  car- 
ried out  some  measurements.  Our  results  showed  that  in  case  of  a normal  daily 
scenario,  such  as  traveling  approx.  1 hour  to  the  workplace  in  the  morning  and 
1 hour  back  home  in  the  afternoon,  the  readings  and  local  processing  consume 
1.48  Wh  energy  on  average.  This  is  equivalent  roughly  to  20%  of  the  capacity  of 
an  average  smartphone  battery  (2,000  mAh,  3.7  V). 

Besides  the  GPS  coordinates  Google  also  provides  location  information  in 
those  areas,  where  there  is  no  GPS  signal  available.  Usually  this  position  is  highly 
inaccurate,  but  the  estimated  accuracy  is  also  provided.  Moreover,  the  activity 
sensor,  which  guesses  the  actual  activity  of  the  user,  is  also  used  by  Trafficlnfo. 
Currently,  the  supported  activities  are:  in  vehicle , on  bicycle , on  foot,  running , 
still,  tilting,  walking  and  unknown.  The  sensor  monitoring  part  of  Trafficlnfo  is 
active  only  if  the  activity  recognition  reports  in  vehicle  status.  Otherwise,  sensor 
monitoring  is  suspended.  Note,  that  the  accuracy  of  activity  recognition  can 
be  varying.  However,  in  our  experiments  the  in  vehicle  activity  was  recognized 
with  more  than  85%  accuracy,  which  made  this  tool  useful  for  our  purposes. 

The  collected  sensor  readings  on  one  hand  are  uploaded  to  the  XMPP  server. 
There  the  Analytics  module  processes  and  shares  them  among  participants 
who  are  subscribers  of  the  relevant  information.  On  the  other  hand,  they  are 
used  locally.  For  instance,  based  on  the  timestamps  of  the  detected  stop  events 
the  server  side  analytics  estimate  the  upcoming  arrival  times  of  the  given  vehi- 
cle and  disseminate  live  timetable  updates  to  the  subscribers. 


Stop  Event  Detection 

The  fourth,  most  challenging  function  of  Trafficlnfo  is  to  detect  stop  events 
of  public  transport  vehicles  without  user  interaction.  Trafficlnfo  implements 
such  a detector  locally  on  the  mobile  device.  When  stop  events  are  detected 
a summary  of  information  is  transmitted  to  the  XMPP  server.  This  summary 
consists  of  the  location  and  timestamp  of  the  event  and  the  time  elapsed  since 
the  last  stop  event.  The  final  decision  is  made  by  the  server  based  on  the  peri- 
odic reports  from  the  passengers.  It  is  a majority  decision,  so  if  the  majority  of 
the  reports  indicate  a stop  event  within  a given  time  window  the  detection  is 
made  otherwise  not. 

The  stop  event  detection  mechanism  is  based  on  features.  Hence,  several  fea- 
tures were  generated  from  the  experimental  usage  logs  collected  during  the 
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measurements.  In  this  work,  we  investigated  our  detection  mechanism  only  on 
trams  and  left  buses/trains  as  part  of  future  work.  The  approximately  1GB  meas- 
urement data  were  collected  by  10  volunteers  during  the  1 month  measurement 
period  using  Samsung  Galaxy  S3  and  Nexus4  smartphones.  The  gathered  con- 
text data  included  among  others  GPS,  Wi-Fi,  cellular  network  and  acceleration 
sensor  readings.  For  classification  the  J48  decision  tree  implementation  of  the 
Weka  data-mining  tool  (Hall  et  al.  2009)  was  used.  With  the  combination  of 
the  defined  features  and  models  the  detector  can  detect  stop  events  automati- 
cally with  relatively  high  accuracy  (with  0.86  AUC  - Area  Under  the  Curve) 
within  13  seconds  after  the  arrival  at  the  station.  The  place  of  the  stop  event  is 
decided  by  investigating  the  GPS  position  and/or  the  Wi-Fi/cellular  network 
fingerprint  of  the  environment,  so  stops  at  stations  can  be  distinguished  from 
other  stops  with  high  probability. 


Summary 

In  this  chapter,  after  a short  literature  review  a generic,  XMPP  based  communi- 
cation framework  was  introduced  which  was  designed  to  facilitate  the  develop- 
ment of  crowd  assisted  smart  city  applications.  Then  a prototype  crowdsensing 
based  smart  timetable  service  was  presented.  Its  front-end  Android  applica- 
tion, called  Trafficlnfo,  together  with  an  automatic  stop  event  detector  was 
introduced  in  detail.  This  service  was  implemented  on  top  of  the  introduced 
generic  framework.  It  updates  static  public  transport  timetables  and  delivers 
the  updated  information  to  its  subscribers  in  real-time. 
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Abstract 

The  Smart  City  connects  citizens  in  novel  ways  by  leveraging  the  latest  advances 
in  information  and  communication  technologies  (ICT).  Smart  citizens  have 
various  ICT  solutions  at  their  disposal,  which  allow  them  to  optimize  their  day- 
to-day  activities  in  the  urban  environment  they  live  and/or  work.  The  integra- 
tion of  rich  sensing  capabilities  (e.g.  camera,  microphone,  GPS,  accelerometer, 
barometer)  in  todays  mobile  devices  allows  their  users  to  sense  their  urban  envi- 
ronment in  often  unforeseen  ways.  In  mobile  crowd-sensing  the  citizens  of  the 
Smart  City  collect,  share  and  jointly  use  services  based  on  the  sensed  data,  e.g. 
the  Waze  application  for  optimized  car-based  navigation,  the  Smart  Citizen  pro- 
ject for  collecting  meteorological  measurements.  This  paper  presents  the  current 
state-of-the-art  and  future  challenges  in  mobile  crowd-sensing  in  urban  environ- 
ments, by  focusing  on  sensing  in  the  following  focus  areas:  environment,  citizen 
collaboration,  urban  traffic  systems,  health/fitness  and  social  networking.  From 
each  of  these  areas  a set  of  representative  applications  (e.g.  Waze,  Foursquare, 
Ushahidi)  were  selected,  analyzed  and  compared  based  on  the  following  crite- 
ria: expected  social  and  economic  impact,  novelty  and  sophistication  of  system 
architecture,  sensing  methods  applied,  motivation  techniques  and  user  privacy. 
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Introduction 

The  rich  sensing  capabilities  integrated  into  modern  mobile  devices  allow  their 
users  to  use  them  for  novel,  often  unforeseen  activities.  The  list  of  sensors  inte- 
grated into  the  latest  flagship  mobile  devices  includes  basic  ones  like  the  micro- 
phone necessary  to  record  the  users  voice  and  the  touchscreen  necessary  for 
text  input,  through  the  also  visible,  one  or  more  cameras  used  to  record  images 
from  the  users’  surroundings  and/or  of  the  users  themselves  (i.e.  selfies),  as  well 
as  more  obscure  sensors,  like  the  accelerometer  for  sensing  acceleration,  gyro- 
scope for  orientation,  proximity  for  distance,  compass  for  spatial  bearing,  GPS 
for  geographic  location  and  barometer  for  atmospheric  pressure.  The  latest 
offerings  (e.g.  the  Samsung  Galaxy  6 Edge  in  early  2015)  might  offer  personal 
health  related  sensing  as  well  in  the  form  of  heart- rate  and  oxygen  saturation 
sensing.  Devices  might  identify  their  users  with  built-in  fingerprint  sensors,  or 
via  scanning  and  recognizing  their  fingerprints  via  their  touchscreens. 

The  microphone,  touchscreen  and  camera  form  a sufficient  subset  of  sensors 
for  the  majority  of  use  cases.  The  rest  of  the  sensors  might  be  used  by  mobile 
device  producers  to  develop  more  user  friendly  behavior,  e.g.  automatically 
detecting  the  tilt  of  the  mobile  device  with  the  gyroscope  in  order  to  rotate  the 
screen  accordingly. 

Mobile  crowd-sensing  (MCS)  is  a relatively  new  discipline,  in  which  the  users 
of  modern  smart  phones  use  the  rich  sensing  capabilities  of  their  devices  to  col- 
lect and  share  information  while  on  the  move,  as  well  as  to  form  micro -crowds 
around  a certain  crowd- sensing  activity  (Cardone  et  al.  2013).  Current  state 
and  future  MCS  challenges  were  discussed  in  Ganti,  Ye  and  Lei  (2011),  while 
Zambonelli  (2011)  dissects  a more  general  theme,  namely  urban  crowd-sourc- 
ing. Goodchild  (2007)  was  one  of  the  first  identifying  the  crowd  as  a possible 
sensing  ‘tool’.  The  efficiency  and  efficacy  of  mobile  crowd- sensing  is  discussed 
in  Ma,  Zhao  and  Yuan  (2014). 

The  Smart  City  is  the  future  city  which  leverages  the  latest  advances  in  infor- 
mation and  communication  technologies  (ICT)  in  order  to  optimize  its  opera- 
tions and  the  everyday  processes  in  which  the  smart  citizens  take  part.  Mobile 
crowd-sensing  is  one  ICT  tool  which  might  be  leveraged  in  Smart  Cities,  as  it 
reaches  the  smart  citizens  and  involve  them  in  the  optimization  of  the  city’s 
processes. 

Crowd-sensing  simulation  efforts  (Farkas  8c  Lendak  2015;  Lendak  8c  Farkas 
2015;  Tanas  8c  Herrera- Joancomart  2013)  aim  to  simulate  crowd  behavior,  fore- 
cast sensing  patterns  and  help  researchers  and  solution  developers  to  choose 
what  to  sense  as  well  as  to  identify  the  minimum  user  threshold  necessary  for 
an  application  to  collect  sufficiently  ‘big’  data,  which  the  algorithms  can  crunch 
in  order  to  produce  useful  information.  Trustworthiness  of  the  data  sensed  by 
the  crowd  is  also  relevant  and  analyzed  in  Tanas  8c  Herrera- Joancomart  (2015). 

Both  mobile  crowd-sensing  and  the  Smart  City  are  intriguing  novel  research 
and  development  domains,  with  numerous  magazine  and  journal  special  issues 
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devoted  to  their  analysis,  e.g.  the  August  2015,  June  2013  and  June  2011  issues 
of  the  IEEE  Communications  Magazine,  the  June  2013  issue  of  the  Journal  of 
Knowledge  Economy,  etc.  This  paper  builds  on  those  results  and  discusses  the 
latest  mobile  crowd-sensing  efforts  in  the  Smart  City  setting,  with  a special 
focus  on  the  following  areas:  environment,  citizen  collaboration,  urban  traffic 
systems,  health/fitness  and  social  networking.  From  each  of  these  areas  one  or 
two  representative  applications  were  selected  based  on  their  technical  sophisti- 
cation and  size  of  user  base,  as  an  easy  measure  of  success.  The  solutions  chosen 
were  analyzed  by  trying  to  answer  the  following  questions: 

• How  do  they  impact  society  at  large  (i.e.  societal  impact)? 

• What  is  their  expected  economic  impact? 

• What  is  their  system  architecture  like?  Is  it  sophisticated  and  does  it  contain 
novel  solutions? 

• Which  sensors  and  how  do  they  employ  towards  reaching  their  goals? 

• How  do  they  motivate  their  users  to  contribute  and  use  the  application? 

• How  do  they  address  the  sensitive  question  of  user  privacy? 

Apart  from  this  introduction,  the  paper  contains  five  sections  discussing 
mobile  crowd-sensing  based  applications  from  the  above  identified  five  focus 
groups.  Their  descriptions  are  followed  by  their  comparative  analysis  in  section 
seven. 


Urban  environment 

The  latest  offerings  in  the  smartphone  arena  come  equipped  with  a limited  set 
of  meteorological  sensors,  e.g.  barometer,  thermometer.  Apart  from  the  obvi- 
ous meteorological  sensors,  the  integrated  microphone  can  be  used  for  sensing 
noise  levels,  and  the  camera  for  recording  specific  meteorological  or  other  phe- 
nomena. Noise  level  sensing  can  be  automated,  while  using  the  camera  requires 
human  interaction.  In  general,  modern  mobile  devices  are  still  lacking  in  sens- 
ing capabilities  focused  on  collecting  information  about  our  (natural)  environ- 
ment. These  limitations  might  be  mitigated  by  purpose-built  sensing  hardware. 

The  Smart  Citizen  (SC)  project  is  a mobile  crowd-sensing  based  project 
whose  goal  is  to  build  a platform  for  collecting  environmental  measurements 
in  urban  settings.  Its  website1  is  pictured  in  Figure  1.  SC  is  an  open-source 
platform  consisting  of  a hardware  device  (the  Smart  Citizen  Kit),  an  applica- 
tion programming  interface  utilizing  RESTful  web  services,  a mobile  applica- 
tion, a website  and  a web  based  community  of  volunteers,  i.e.  the  crowd’.  The 
hardware  kit  is  equipped  with  sensors  which  measure  air  composition  (CO 
and  N02),  temperature,  light  intensity,  sound  levels,  and  humidity.  It  is  able 


1 Smart  Citizen  projects  website,  https://smartcitizen.me/ 
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Figure  1:  The  Smart  Citizen  website. 


to  stream  data  measured  by  the  sensors  over  Wi-Fi.  Power  to  the  device  can  be 
provided  by  a solar  panel  and/or  battery,  and  it  can  be  placed  on  balconies  or 
window  sills. 

SC  does  not  employ  gamification  or  other  motivation  mechanisms,  which 
might  increase  public  interest  towards  this  solution.  It  has  a detailed  privacy 
policy  available  via  the  web-based  interface.  Username  and  geographic  loca- 
tion can  be  grabbed  from  the  screen  - which  negatively  impact  smart  citizen 
privacy.  By  removing  the  username  from  the  web  interface,  the  privacy  level  of 
the  application  could  be  significantly  improved. 

Another  interesting  project  dealing  with  environmental  issues  in  and  near 
urban  environments  is  Danger  Maps2,  a crowdsourced,  web -based  environ- 
ment monitoring  solution  originating  from  China.  Its  primary  goal  is  to  collect 
and  share  the  locations  of  the  various  sources  of  pollution,  e.g.  garbage  dumps, 
toxic-waste  treatment  facilities,  oil  refineries  and  power  plants.  The  applica- 
tion is  popular  in  China  where  pollution  is  a serious  issue.  It  was  created  after 
its  founder  learned  that  the  Shanghai  apartment  he  bought  in  2007  was  near 
a landfill  - something  he  wasn’t  informed  of  when  negotiating  the  purchase. 
Originally,  the  old’  Danger  Maps  contained  official  data  and  maps  released  by 
the  Chinese  Environmental  Protection  Agency,  but  since  20 133  the  crowd  is 
allowed  to  create  detailed  custom  maps  themselves  via  a web  based  interface. 
Danger  Maps  relies  on  social  sensors  (i.e.  human  users)  as  the  reports  are  posted 
in  textual  format.  The  camera  might  be  used  as  well  for  taking  pictures  of  the 
pollution  sources.  Unfortunately,  the  website  is  available  only  in  Chinese  - or  at 
least  the  author  failed  to  find  the  link  to  an  English  language  version. 

Efforts  similar  to  the  Smart  Citizen  or  the  Danger  Maps  projects  have  signifi- 
cant societal  impact,  as  they  allow  the  crowd  to  collect  and  share  information 


2 Danger  Maps  official  website,  http://www.epmap.org/ngo 

3 Custer,  C.,  ‘Danger  Maps’  Invites  You  to  Map  Chinas  Polluted  Areas  via  New  Open-Platform 
Maps,  2013,  https://www.techinasia.com/danger-maps-invites-map-chinas-polluted-areas- 
openplatform-maps/ 
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about  both  major  sources  of  pollution  and  the  quality  of  the  air  we  breathe, 
which  might  not  have  been  mapped  otherwise.  Solutions  similar  to  Danger 
Maps  might  be  especially  interesting  in  the  developing  world,  where  laws  regu- 
late the  protection  of  our  immediate  environment  to  a limited  extent,  or  where 
modern  legislation  is  available,  but  not  enforced.  Hopefully  the  power  of  the 
crowd’  exercised  via  solutions  similar  to  the  projects  discussed  in  this  section, 
might  put  additional  pressure  on  both  legislative  and  administrative  bodies 
in  the  environmental  protection  domain  and  force  them  to  act  more  quickly 
and  decisively.  The  immediate  economic  impact  of  these  two  solutions  is  lim- 
ited at  the  moment,  but  might  rise  with  the  wider  adoption  of  crowd- sensing, 
i.e.  when  the  user  bases  of  these  projects  become  larger  and  the  societies  built 
around  them  gain  more  lobbying  power. 


Citizen  collaboration 

Crowd-sensing  applications  give  a powerful  tool  into  the  hands  of  human  soci- 
eties, e.g.  they  allow  citizens  to  reach  their  governments  about  non-essential 
issues  they  detect  within  their  communities,  like  issues  reported  in  the  streets 
with  FixMyStreet4  (see  Figure  2)  and  similar  solutions  (e.g.  SeeClickFix5  in 
the  USA).  These  applications  usually  consist  of  a mobile  application  which  is 
used  for  sensing  and  a website  which  displays  the  sensed  events  in  near  real- 
time (see  Figure  2).  FixMyStreet  (FMS)  is  a crowd-sensing  platform  which  can 
be  customized  for  any  urban  area.  In  FMS  the  issue  reports  are  linked  to  the 
reporter’s  email  address,  but  FMS  ensures  its  users  that  only  the  representatives 


4-  C :::  i www.fixmystreet  com/around 


* @ © ® 


Figure  2:  FixMyStreet  issue  reports  in  Manchester,  UK  (April  19th,  2015). 


4 FixMyStreet  issues  in  Manchester,  UK:  https://www.fixmystreet.com/around?pc=manchester 

5 SeeClickFix  website,  http://en.seeclickfix.com 
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of  the  council  who  will  address  the  report,  and  FMS’  administrative  staff  might 
be  allowed  to  see  the  users’  email  addresses.  The  camera  is  the  most  important 
physical  sensor  used  to  take  pictures  of  the  issues  the  users  are  reporting.  It 
does  not  contain  motivation  tools  which  would  possibly  allow  it  to  build  more 
effective  human  sensor  networks,  by  allowing  them  to  compete  and  take  part 
in  in -application  games. 

Crowd-sensing  based  citizen  collaboration  efforts  like  FixMyStreet  or  See- 
ClickFix  might  allow  smart  citizens  to  collaborate  on  issues  affecting  local 
groups  in  an  urban  environment.  The  loosely  coupled  social  networks  formed 
around  these  solutions  might  more  easily  obtain  the  attention  of  local  admin- 
istration and  coax  them  into  taking  corrective  action  in  the  areas  of  interest, 
e.g.  fix  a pothole,  or  clean  up  an  unplanned  garbage  dump.  These  solutions 
might  have  a measurable  economic  impact  as  well,  mainly  for  local  adminis- 
trations, which  might  find  out  about  issues  sooner,  fix  them  and  spend  less  on 
paid  inspectors  who  would  travel  around  the  urban  areas  and  look  for  potholes, 
garbage  and  similar. 

Crowd- sensing  might  allow  wider  collaboration  during  disaster  relief  and 
during  political  turmoil,  e.g.  the  Ushahidi6  (Swahili  for  ‘testimony’  or  ‘witness’) 
website  started  after  Kenya’s  disputed  presidential  elections  in  2007.  In  Usha- 
hidi, when  an  event  occurs  a volunteer  sends  a brief  report  from  a smartphone, 
via  the  Web  or  text  message,  and  the  software  annotates  it  with  time  and  loca- 
tion information.  Such  information  can  then  be  visualized  and  ‘mined’.  It  relies 
on  social  sensors  and  on  the  camera,  as  the  reports  are  written  by  people  in 
natural  language  and  might  be  accompanied  by  a photograph  or  a video.  The 
human  sensors  are  usually  motivated  by  our  built-in  altruism,  i.e.  our  urge  to 
help  others  or  contribute  towards  a greater  good.  Ushahidi  itself  does  not  con- 
tain a motivation  scheme  which  would  reward  its  users  for  sharing  informa- 
tion. User  privacy  is  quite  important  in  Ushahidi,  especially  in  countries  where 
people  might  get  into  trouble  for  sharing  negative  views  about  the  govern- 
ment or  other  bodies.  The  Ushahidi  privacy  policy  claims  that  only  aggregated 
and  non-personally  identifiable  information  is  shared  with  third  parties.  The 
reports  shared  via  the  Ushahidi  application  for  Android  devices  do  not  contain 
personally  identifiable  information. 

Ushahidi  allows  the  crowd  of  its  users  to  report  issues  affecting  large  groups 
or  societies,  e.g.  the  attempted  rigging  of  elections  in  Kenya7  or  the  2010  earth- 
quake in  Haiti8.  Therefore,  it  has  a quite  significant  societal  impact.  Its  immedi- 
ate economic  impact  is  limited,  or  at  least  it  is  very  hard  to  measure. 


6 Ushahidi  official  website,  http://www.ushahidi.com/ 

7 Beyond  Voting  on  the  Ushahid  official  website,  http://www.ushahidi.com/2015/05/21/beyond- 
voting-using-ushahidi-to-help-citizens-protect-their-elections/ 

8 Ushahidi  Haiti  Project  - Evaluation  Final  Report,  http://www.ushahidi.com/2011/04/19/ 
ushahidi-haiti-project-evaluation-final-report/ 
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Traffic 

Modern  vehicles  have  rich  sensing  and  computing  capabilities,  which  might 
be  used  to  sense  and  share  information,  e.g.  they  might  detect  when  a parking 
spot  is  taken  and  share  the  information  automatically.  Smartphones  might  also 
detect  and  share  certain  events  automatically,  e.g.  the  Google  Activity  Recog- 
nition library  can  discern  whether  the  device  holder  is  walking,  driving  a car 
or  running,  based  on  the  accelerometers  measurements.  As  it  will  be  shown 
below,  the  most  important  sensors  used  in  traffic  system  related  crowd-sensing 
applications  are  the  GPS  sensor  and  the  accelerometer. 

Waze9  is  arguably  the  most  successful  crowd- sensing  based  application  in  the 
traffic  systems  domain.  Its  primary  function  is  point-to-point  navigation,  but 
it  performs  this  function  with  a twist:  it  allows  drivers  and  other  participants 
(e.g.  co-driver)  to  share  roadside  events,  e.g.  road  works,  accidents,  police  pres- 
ence, traffic  jams.  These  event  reports  are  then  aggregated,  shown  on  the  map 
and  used  by  the  navigation  algorithm,  which  might  help  drivers  to  avoid  road- 
side events  leading  to  traffic  jams,  e.g.  a collision  during  rush  hour.  Waze  is  not 
limited  to  urban  environments,  i.e.  it  is  not  a strictly  a Smart  City  application. 

Waze  consists  of  a mobile  application  used  for  navigation  and  issue  report- 
ing, a big  data  storage,  a service  for  running  the  data  analysis  algorithms,  and  a 
web-based  live  map  showing  the  latest  events.  It  applies  an  intricate  motivation 
scheme,  which  includes  user  levels  (ranging  from  baby,  via  warrior  to  ‘king’) 
based  on  the  amount  of  points,  which  might  be  collected  through  long  hours  of 
active  use  and  issue  reports,  user  avatars,  in-app  messaging  and  occasional  in- 
app  games.  In  one  such  game  the  mobile  application  generated  Easter  eggs  in 
the  streets  near  the  driver  and  awarded  extra  points  for  collecting  them.  Waze 
uses  the  GPS  sensor  to  calculate  the  current  location,  the  camera  to  take  pic- 
tures of  the  events  and  the  accelerometer  to  automate  certain  event  detections, 
e.g.  stop-and-go  traffic.  Most  of  the  events  are  sensed  by  the  social  sensors, 
i.e.  the  users  manually  annotate  a roadside  event  by  clicking  on  its  icon  in  the 
mobile  application  or  choosing  its  type  from  a list. 

Waze  links  the  user’s  account  to  his/her  mobile  phone  (number)  and  shows 
an  avatar  onscreen  (both  on  the  mobile  and  in  the  Web  based  live  map)  at 
the  GPS  position  of  the  user.  Other  nearby  users  are  shown  onscreen,  not  just 
friends,  thereby  allowing  to  learn  their  whereabouts  based  on  grabbing  screen- 
shots  containing  their  avatars.  Additionally,  it  is  possible  to  report  non-existing 
roadside  events,  e.g.  traffic  jams  by  sending  in  well-formed  Waze  messages 
from  a custom-built  application.10  Such  message  fabrication  attacks  might  be 
used  to  cause  havoc  in  traffic  systems. 


9 Waze  website,  https://www.waze.com 

10  T.  Jeske,  “Floating  Car  Data  from  Smartphones:  What  Google  And  Waze  Know  About  You  and 
How  Hackers  Can  Control  Traffic”,  https://media.blackhat.com/eu-13/briefings/Jeske/bh-eu- 
13-floating-car-data-jeske-slides.pdf 


360  European  Handbook  of  Crowdsourced  Geographic  Information 


Apart  from  steering  drivers  clear  of  congestion,  crowd-sensing  might  come 
in  handy  in  solving  parking  problems  in  busy  urban  areas,  where  there  are  no 
funds  to  develop  an  advanced  infrastructure  of  parking  sensors  in  the  streets, 
as  done  in  San  Francisco  with  the  SFPark11  system,  or  in  the  City  of  Westmin- 
ster (London)  with  Smart  Parking.12  One  such,  crowd-sensing  based  solution, 
Googles  OpenSpot13  tried  to  use  the  power  of  the  crowd  to  sense  parking  related 
events,  and  based  on  that  data  provide  suggestions  to  drivers  who  were  looking 
for  parking.  It  was  cancelled  in  2012  due  to  its  limitations,  mainly  its  inability 
to  adapt  to  busy  urban  environments  where  a parking  spot  might  remain  unoc- 
cupied only  for  a couple  of  seconds,  as  well  as  for  the  lack  of  user  base,  i.e.  the 
size  of  its  user  base  was  insufficient  to  make  it  successful.  Anagog14  is  a promis- 
ing new  player  in  the  urban  parking  arena  (see  Figure  3).  It  uses  Waze’s  data  to 
automatically  sense  and  share  parking  events  - in  essence  it  learns  the  habits  of 
drivers,  i.e.  where  and  when  they  park  their  cars. 

The  analyzed  applications  address  two  pressing  matters  in  the  crowded  urban 
environments  of  the  21st  century,  namely  congestion  and  the  limited  availabil- 
ity of  parking.  Waze  helps  its  users  steer  clear  of  traffic  jams  by  utilizing  the 
reports  received  from  other  Wazers  (i.e.  Waze  users)  who  had  the  misfortune 


11  SFPark  website,  http://sfpark.org/how-it-works/ 

12  Smart  Parking  website,  http://www.smartparking.com/about-us 

13  OpenSpot  in  the  news,  http://www.androidauthority.com/google-labs-open-spot-a-useful- 
application-that-no-one-uses- 151 86/ 

14  Anagog  website,  http://anagog.com 
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of  getting  stuck  in  congestion.  The  parking  assistance  applications  aim  to  use 
crowd-sensed  big  data  in  order  to  suggest  the  most  likely  location  of  an  opti- 
mal parking  spot,  thereby  reducing  the  time  spent  in  cruising  for  parking’, 
consequentially  lowering  petrol  costs  and  time  wasted.  As  both  costs  and  time 
wasted  might  be  measured  in  money,  we  conclude  that  these  applications  have 
a significant  economic  impact,  especially  if  they  reach  the  threshold  number  of 
active  users  allowing  them  to  provide  useful  suggestions  to  the  crowd. 


Health  and  fitness 

The  power  of  the  masses  can  contribute  towards  sharing  information  among 
patients  suffering  from  specific  illnesses.  Apart  from  allowing  patients  to  link 
with  others  who  have  similar  health  problems,  the  data  collected  and  shared 
by  patients  might  be  used  for  health-care  optimization,  e.g.  cancer  survivors 
were  planned  to  be  brought  Together  in  one  such  solution.15  Patient  networking 
websites  like  PatientsLikeMe  (PLM)  allow  individuals  with  certain  health  con- 
ditions to  share  and  compare  their  symptoms  and  responses  to  the  treatments 
they  received.  PLM  relies  on  social  sensors  for  data  collection,  i.e.  people  them- 
selves describe  their  mood  and  physical  condition,  either  by  answering  ques- 
tions asked  by  the  application,  or  writing  textual  descriptions.  PLM  and  other 
similar  tools  might  allow  healthcare  professionals  to  create  more  precise  meas- 
urement and  assessment  tools  based  on  crowd- sensed/crowd- sourced  data,  or 
might  even  offer  early  warning  in  case  of  infectious  disease  outbreaks.  There- 
fore both  the  societal  and  economic  impact  of  these  solutions  are  significant 
as  they  can  improve  the  prospects  of  sick  people  via  mining  the  information 
shared  by  them  and  using  it  to  develop  better  medicines  and  procedures  on 
one  hand,  and  potentially  lowering  costs  on  the  other  hand,  by  allowing  people 
to  learn  more  about  their  condition  even  before  visiting  a physician,  and  being 
capable  of  better  describing  how  and  what  they  feel  based  on  the  information 
shared  by  others  with  similar  conditions. 

The  above  listed  crowd-enabled  medical  solutions  usually  have  simple  archi- 
tectures, e.g.  a website  where  the  users  might  share  information  about  their 
health.  The  users  are  the  sensors  themselves,  as  they  describe  in  text  how  they 
feel  and  what  symptoms  they  have.  PLM  applies  a simple  motivation  scheme 
in  which  it  polls  its  users  daily  in  order  to  remind  them  to  share  information 
about  how  they  feel.  It  also  awards  users  with  stars  for  sharing  information 
about  their  ailments.  Some  news  agencies  reported16  that  personal  health  infor- 
mation was  possible  to  be  collected  from  the  PLM  website  by  interested  third 


15  Together  by  Medstartr,  http://www.medstartr.com/projects/192-together 

16  CBS  News,  “PatiensLikeMe  is  more  villain  than  victim  in  patient  data  “scraping”  scandal”, 
http://www.cbsnews.com/news/patientslikeme-is-more-villain-than-victim-in-patient-data- 
scr  aping- scandal/ 
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Figure  4:  Sports  Tracker  fitness  app  showing  fitness  statistics  and  an  on-map 
route. 


parties,  i.e.  their  privacy  policy  and  its  enforcement  might  not  have  been  as 
strong  as  necessary 

Fitness  apps  (e.g.  Sports  Tracker17  - see  Figure  4)  allow  their  users  to  measure 
their  achievements  while  exercising,  either  automatically  by  reading  the  neces- 
sary information  from  the  built-in  sensors,  or  by  allowing  users  to  manually 
enter  their  results.  The  most  important  sensors  in  fitness  applications  are  the 
GPS,  accelerometer  and  lately  the  heart  rate  monitor  and  oxygen  saturation 
sensor.  Fitness  solutions  also  tend  to  have  relatively  simple  architectures,  con- 
sisting of  a mobile  app  and  a (cloud  based)  data  storage,  where  the  results  of 
exercises  are  stored.  Users  are  motivated  by  allowing  them  to  post  their  achieve- 
ments (e.g.  kilometers  ran)  on  social  networks  or  organizing  competitions  with 
other  users.  Sports  Tracker  also  has  an  elaborate  privacy  policy  clearly  outlin- 
ing how  the  system  uses  the  data  collected  and  shared.  Crowd-sensing  based 
fitness  applications  can  boost  peoples  enthusiasm  towards  physical  exercise 


17  Sports  Tracker,  http://www.sports-tracker.com 
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and  thereby  improve  public  health  and  lower  the  amount  of  funds  spent  on 
healthcare,  i.e.  they  have  a measurable  societal  and  economic  impact. 


Social  networks 


Although  Facebook  posts  and  Twitter  Tweets  might  contain  descriptive 
information  about  our  environment,  events  in  the  traffic  system  and  even 
about  our  health  and/or  fitness,  their  primary  aim  is  not  mobile  crowd-sensing. 
As  opposed  to  the  above  named  leaders  in  the  social  networking  area,  Four- 
square (FS)  (see  Figure  5),  the  local  search,  discovery  and  recommenda- 
tion app  for  mobiles  has  both  social  networking  and  sensing  elements.  FS 
takes  into  consideration  where  its  users  go  and  what  they  tell  the  application 
about  those  places,  and  advises  other  users  where  to  go  and  what  to  visit  near 
their  current  location,  e.g.  it  might  allow  a user  in  a foreign  country  to  find 
points  of  interest  around  him/her  with  only  a smartphone  and  an  internet 
connection. 

The  users  sense  information  about  the  points  of  interest  (POIs)  near  their 
location  by  answering  questions  asked  by  the  application.  The  social  network- 
ing features  contained  in  earlier  versions  (e.g.  check-in  at  a location  and  shar- 
ing the  event  with  friends  also  using  FS)  were  factored  out  into  a separate  appli- 
cation named  Swarm.  This  contributes  towards  privacy,  as  check-ins  are  not 
necessarily  visible  to  followers.  Personally  identifiable  information  is  visible  in 
the  application,  as  the  users’  full  names  and  home  town  can  be  accessed  via 
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Figure  5:  Foursquare  location  based  recommendations. 
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the  ratings  and  tips  they  leave.  Foursquare,  and  especially  its  earlier  versions 
(prior  to  Swarm)  employed  an  elaborate  motivation  scheme,  consisting  of  a 
point  system  in  which  points  were  scored  for  each  check-in,  badges  achieved 
via  checking  in  at  certain  locations,  and  the  users  could  become  ‘Mayor’  of  a 
certain  venue  by  checking-in  more  often  than  any  other  user. 

The  architecture  of  Foursquare  became  considerably  more  elaborate  with  the 
introduction  of  Swarm,  as  now  the  solution  consists  of  a mobile  application,  a 
backend  service  used  for  storing  the  big  data  collected  from  contributors  and 
running  the  recommendation  algorithms,  as  well  as  Swarm,  which  cooperates 
with  the  mobile  application  and  adds  social  networking  features. 

Foursquare  has  a measurable  economic  impact,  as  it  can  steer  visitors  towards 
POIs  (e.g.  restaurant,  bar)  which  have  high  ratings  thereby  increasing  their 
incomes.  Its  societal  impact  is  not  so  clear,  but  it  can  surely  help  its  users  when 
they  are  in  an  unknown  and  new  environment,  e.g.  visiting  a city  in  a foreign 
country  without  a local  guide,  by  steering  them  towards  interesting  places  and 
making  them  feel  less  out-of-place  and  better  connected  to  the  location  they 
are  visiting. 

It  is  important  to  note  that  the  users  of  other  crowd-sensing  based  applica- 
tions (e.g.  Waze,  FixMyStreet)  might  also  be  regarded  as  members  of  dynamic 
social  networks,  formed  around  certain  events  or  activities  (e.g.  drivers  in  and 
around  a single  urban  area). 


Comparative  analysis 

The  applications  and  solutions  described  in  this  paper  were  analyzed  based  on 
the  following  criteria:  social  visibility  and  impact,  expected  economic  impact, 
sophistication  of  their  system  architectures,  sensing  method(s),  motivation 
scheme  and  privacy  level.  For  their  comparative  analysis  presented  in  this  sec- 
tion, from  each  group  one  or  two  applications  were  selected  and  marked  in 
each  of  the  above  listed  areas.  The  ratings  were  made  based  on  publicly  avail- 
able material  and/or  the  authors  own  experience  in  using  the  solutions.  A cou- 
ple of  solutions  were  intentionally  omitted  from  this  comparative  because  of 
various  reasons,  e.g.  Danger  Maps  could  not  be  analyzed  as  most  of  the  availa- 
ble information  is  in  Chinese,  and  there  was  no  new  data  available  online  about 
the  Together  project. 

Societal  impact  was  measured  as  a combination  of  the  number  of  active 
crowd-sensors  and  the  (expected)  level  of  contribution  of  the  selected  applica- 
tion towards  a greater  good,  e.g.  improving  the  lives  of  many.  The  number  of 
active  crowd- sensors  is  not  easily  measured,  as  a varying  percentage  of  the  user 
base  of  these  solutions  is  passive,  i.e.  they  do  not  sense,  just  consume  the  ser- 
vices based  on  the  sensations  of  others.  Waze  and  Foursquare  stand  out  based 
on  the  total  number  of  their  (millions  of)  users.  Although  contribution  towards 
a greater  good  is  even  more  challenging  to  measure,  Ushahidi,  FixMyStreet 
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and  PatientsLikeMe  stand  out  as  they  might  help  people  during  disasters  or 
other  socially  disruptive  events,  help  local  communities  address  their  issues 
more  effectively,  as  well  as  help  inform  patients  and  collect  statistical  informa- 
tion about  their  conditions,  potentially  leading  to  novel  medical  findings  and 
procedures.  Waze  and  Foursquare  have  global  reach,  but  their  mission  is  not  as 
noble’  as  those  of  the  above  listed  three  applications.  As  none  of  the  analyzed 
solutions  have  both  a clear  mission  towards  achieving  a greater  good,  as  well  as 
millions  of  users  globally,  the  author  felt  that  they  all  have  a mid-range  impact. 

Economic  impact  is  a subjective  area  which  is  similarly  challenging  to  meas- 
ure as  societal  impact.  The  author  made  an  attempt  to  compare  the  applications 
by  taking  into  consideration  the  expected  economic  impact  they  should’  make 
based  on  their  primary  goals.  Waze  and  Anagog  stand  out  as  they  might  opti- 
mize urban  traffic,  via  avoiding  traffic  jams  and/or  steering  drivers  more  quickly 
to  empty  parking  spots.  The  mid-range  players  are  FixMyStreet,  PatientsLikeMe 
and  Foursquare.  FMS  allows  its  users  to  report  issues  in  their  neighborhoods  to 
their  councils,  which  might  not  know  about  those  issues  otherwise,  or  would 
need  to  employ  people  to  manually  identify  them.  PLM  allows  its  users  with 
various  illnesses  to  learn  about  their  condition  from  other  people  having  simi- 
lar conditions,  which  in  turn  might  allow  early  diagnosis  even  before  visiting  a 
doctor,  thereby  lowering  healthcare  expenditure.  FS  might  allow  highly  rated 
businesses  to  attract  tourists  and  other  visitors  and  thereby  raise  their  revenue. 

The  novelty  and  complexity  of  system  architecture  was  measured  based  on 
counting  the  total  number  of  the  following  components  identified  during  the 
analysis  of  the  crowd-sensing  applications:  purpose-built  sensing  hardware,  a 
website  with  live  data,  a mobile  (sensing)  application,  a big  data  store,  advanced 
algorithms  used  for  analyzing  the  data  collected  and  generating  additional 
value  and  information,  as  well  as  social  networking  features.  Apart  from  the 
count  of  various  elements,  another  important  measure  was  the  perceived  level 
of  seamlessness  of  their  integration  into  a user  oriented,  integrated  product. 
Foursquare  (with  Swarm)  and  Waze  are  similar  in  not  building  custom  sensing 
hardware,  but  having  all  the  other  components  and  being  quite  user-friendly 
and  seamlessly  integrated.  Foursquare  boasts  an  elaborate  mix  of  information 
sensing  and  sharing  mobile  application  combined  with  a social  networking 
component  (Swarm)  and  a powerful  backend  aggregating  the  recommenda- 
tions. Waze  also  consists  of  a mobile  application,  a Web-based  live  map  and 
backend  for  crunching  the  incoming  data,  creating  (alternative)  routes  and 
generating  various  interesting  in-app  games  with  rewards  to  the  users  who  par- 
ticipate. The  Smart  Citizen  project  on  the  other  hand  has  purpose  built  hard- 
ware, but  is  lacking  in  the  area  of  generating  additional  information  and  social 
networking  features.  The  rest  of  the  projects  had  slightly  less  complex  system 
architectures,  as  they  were  either  mostly  web  based  (e.g.  PLM),  or  seemed  to  be 
more  focused  on  their  presence  on  mobile  devices  (e.g.  Sports  Tracker). 

The  sensing  capabilities  of  the  applications  were  assessed  based  on  the 
number  of  hardware  sensors  used  by  them,  the  sophistication  of  social  sensor 
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utilization  (i.e.  how  well  they  handle  the  data  manually  entered  by  their  users), 
as  well  as  the  existence  of  advanced  signal  processing  algorithms  applied.  Waze 
and  Sports  Tracker  excel  in  all  three  areas.  Waze  utilizes  the  hardware  sensing 
features  of  mobile  devices  (e.g.  GPS,  accelerometer),  allows  its  users  to  manu- 
ally enter  useful  data  and  automates  sensing  via  its  signal  processing  features, 
e.g.  the  automatic  detection  of  stop-and-go  traffic.  Sports  Tracker  also  auto- 
mates the  collection  of  track  length  ran  or  cycled,  and  calculates  the  length  and 
cost’  (e.g.  in  calories)  of  an  exercise.  It  also  relies  on  the  hardware  sensing  capa- 
bilities of  mobile  devices  (e.g.  GPS)  and  allows  users  to  manually  enter  data 
about  exercises  whose  detection  it  does  not  automate  (e.g.  weight  lifting).  The 
Smart  Citizen  project  also  stands  out,  as  it  is  the  only  solution,  which  developed 
a custom-built  sensing  module  in  order  to  achieve  its  smart  city’  objective, 
namely  to  measure  the  various  characteristics  of  our  environment. 

Crowd- sensing  based  solutions  cannot  succeed  without  the  crowd-sensors, 
i.e.  without  their  users.  In  today’s  fast-paced  life  it  is  hard  to  learn  the  habit  of 
devoting  some  of  our  time  towards  collecting  (i.e.  sensing)  data  and  sharing  it, 
thereby  helping  others.  Because  of  that,  and  in  order  to  maintain  a loyal  user 
base,  it  is  necessary  to  be  innovative  when  trying  to  motivate  the  users.  Dur- 
ing the  analysis  of  the  crowd- sensing  applications,  the  following  motivation 
mechanisms  were  identified:  financial  motivation,  appealing  to  our  built-in 
altruistic  nature  (i.e.  it  is  natural  to  people  to  try  to  help  others),  gamification, 
in-app  games  and  social  networking.  While  financial  motivation  was  not  pre- 
sent in  the  analyzed  solutions,  most  relied  on  our  altruism  as  a driving  factor, 
with  Sports  Tracker  being  an  exception  as  it  is  built  for  sensing  tasks  about  a 
single  user.  The  altruistic  element  of  motivation  was  most  prominent  in  Usha- 
hidi.  Gamification  usually  means  that  the  application  is  assigning  some  reward 
(e.g.  points)  for  each  piece  of  data  sensed,  i.e.  making  the  users  feel  that  the  use 
of  the  application  is  a game  and  thereby  urging  them  to  keep  on  sensing.  In-app 
games  keep  the  users  locked  to  the  application  longer.  The  social  networking 
features  allow  the  users  to  somehow  reach  their  peers.  In  the  motivation  area 
Waze  has  a dynamic  points  system  with  levels,  which  are  obtainable  after  long 
hours  of  use  and  contributing  information,  as  well  as  in-app  games  and  mes- 
saging. It  also  plays  on  the  natural  altruism  of  its  users,  as  it  allows  them  to  help 
others  via  marking  a police  presence  or  speed  control.  Foursquare  also  had  a 
points-based  system,  badges  awarded  for  certain  check-ins,  social  networking 
layer  for  staying  in  contact  with  friends  and  it  allowed  users  to  become  mayors’ 
of  the  places  they  visited  more  often  than  others.  This  latter  might  be  regarded 
as  a hybrid  form  of  a gamification  feature  and  an  in-app  game.  PatientsLikeMe 
has  a modest  motivation  scheme  consisting  of  daily  polling  in  which  it  asks  its 
users  to  share  information  about  their  health.  Apart  from  that  it  also  awards  up 
to  three  stars  for  sharing  specific  information. 

The  following  privacy  aspects  of  the  crowd- sensing  solutions  were  taken 
into  consideration:  the  amount  of  personally  identifiable  information  shown 
on  screen  to  other  users,  the  existence  of  news  reports  or  other  proof  about 
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privacy  breaches  and  the  sophistication  of  techniques  used  to  ensure  the  pri- 
vacy of  the  crowd-sensors.  In  FixMyStreet  the  personal  information  collected 
is  limited,  it  is  not  shown  on  screen  and,  according  to  the  privacy  policy,  it 
is  shared  only  with  system  administrators  and  council  members  who  might 
be  contacted  to  address  the  issue  reports  sent  via  the  application.  Ushahidi 
anonymizes  the  contributions  of  its  users,  there  were  no  reports  about  privacy 
breaches  at  the  moment  of  writing,  and  it  also  has  an  elaborate  privacy  policy. 
Sports  Tracker  also  applies  the  privacy  components  and  there  were  no  known 
privacy  breaches.  The  Smart  Citizen  project  and  Waze  apply  the  above  tech- 
niques, but  show  both  the  username  and  the  location  of  the  sensors  on  screen, 
which  might  allow  malevolent  third  parties  to  misuse  that  information.  The 
rest  of  the  analyzed  applications  also  employ  techniques  which  raise  the  level 
of  user  privacy,  but  they  either  do  not  completely  hide  personally  identifiable 
information  (e.g.  full  name  and  address  visible  in  Foursquare),  or  there  were 
news  reports  about  some  form  of  privacy  breaches  (e.g.  PLM). 

Table  1 contains  a comparative  overview  of  the  analyzed  crowd-sensing 
based  solutions.  The  following  three  types  of  marks  were  assigned  based  on 
their  analysis  laid  out  above: 

• Leader/High  impact  - the  solution  applies  the  majority  of  the  available  tech- 
niques in  the  subject  domain  or  has  the  highest  expected  social/economic 
impact. 

• Mid-range  capabilities/impact  - the  solution  applies  some  of  the  techniques 
relevant  in  the  area  or  has  medium  societal/economic  impact. 

• Limited  capabilities/low  impact  - the  solution  has  limited  capabilities  or  its 
societal/economic  impact  is  marginal. 

Essentially,  the  analyzed  applications  are  quite  different  and  their  comparison 
was  done  by  analyzing  some  of  their  common,  key  aspects.  The  author  did  not 


App/Solution 

Societal 

impact 

Economic 

impact 

Architect. 

Sensing 

Motiv. 

Privacy 

Smart  Citizen 

• • 

• 

• • 

• • 

• 

• • 

FixMyStreet 

• • 

• • 

• 

• 

• 

• •• 

Ushahidi 

• • 

• 

• 

• 

• 

• •• 

Waze  + Anagog 

• • 

• •• 

• • 

• •• 

• •• 

• • 

PatientsLikeMe 

• • 

• • 

• 

• 

• • 

• 

Sports  Tracker 

• • 

• 

• 

• •• 

• 

• •• 

Foursquare 

• • 

• • 

• • 

• 

• •• 

• 

Table  1:  Comparative  analysis  of  crowd-sensingbased  applications  (•••  - leader/ 
high  impact,  - mid-range  capabilities/impact,  • - limited  capabilities/ 
impact). 
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attempt  to  directly  compare  the  solutions  based  on  the  number  of  points  they 
were  awarded  in  Table  1. 

The  author  plans  to  extend  the  initial  results  presented  in  this  section  as  part 
of  his  future  research,  by  including  a more  detailed  measurement  of  the  num- 
ber of  total  and  active  users  of  each  solution,  adding  a more  detailed  econo- 
metric analysis  of  the  diverse  solutions  discussed,  as  well  as  by  identifying  and 
including  them  additional  solutions  in  each  group  and  localizing  the  compari- 
sons inside  each  group,  thereby  avoiding  directly  comparing  health  and  traffic 
apps  to  each  other. 


Summary 

This  paper  contains  a comparative  analysis  mobile  crowd- sensing  solutions 
in  the  Smart  City  environment.  The  analysis  was  focused  on  crowd-sensing 
in  the  following  focus  areas:  environment,  citizen  collaboration,  urban  traffic 
systems,  health/fitness  and  social  networking.  From  each  of  these  focus  areas 
up  to  two  successful  applications  were  selected  and  analyzed.  The  applications 
were  analyzed  (e.g.  Waze,  Foursquare,  Ushahidi,  the  Smart  Citizen  project, 
PatientsLikeMe)  by  a compound  comparison  criteria  consisting  of  the  follow- 
ing elements:  societal  and  economic  impact,  sophistication  of  system  architec- 
ture, novel  methods  of  sensor  use,  motivation  scheme  and  steps  taken  towards 
ensuring  user  privacy. 

As  a continuation  of  this  work,  the  author  intends  to  identify  additional 
crowdsensing-based  applications  in  the  Smart  City  ecosystem  and  do  a more 
detailed  comparative  analysis,  especially  in  the  areas  which  might  be  objectively 
measured,  e.g.  number  of  total  vs  active  users,  economic  impact  via  economet- 
rics, etc.  Apart  from  that,  the  author  also  plans  to  analyze  crowd- sensing  appli- 
cation use  through  space  and  time  and  find  patterns  leading  to  success  or  failure. 
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Abstract 

In  this  paper,  we  present  the  application  of  the  mobile  crowd-sensing  para- 
digm in  supporting  efficient,  safe  and  green  mobility  in  urban  environments. 
We  have  developed  the  CitySensing  framework  demonstrating  the  viability  of  a 
common  crowd- sourcing  platform  applied  to  various  urban  mobility  domains. 
We  argue  that  todays  mobile  devices,  with  integrated  or  add-on  sensors,  can 
be  efficiently  used  to  crowd  source  diverse  information  in  domains  that  are 
relevant  to  urban  life  and  mobility  (traffic,  air  quality  and  citizens’  everyday 
activities).  This  is  illustrated  by  three  distinct  mobile  applications,  developed 
on  top  of  the  CitySensing  framework,  that  contribute  to  a common  goal  of 
smarter  urban  mobility.  Commonly  integrated  accelerometer  and  GPS  are  used 
to  infer  traffic  events  and  conditions.  Externally  attached  or  integrated  air  qual- 
ity sensors  enable  suggestions  for  city  areas  adequate  for  outdoor  activities  on 
a specific  day  of  the  week  or  hour  of  the  day.  Mobile  phone  usage  statistics  and 
analysis  can  present  valuable  information  to  urban  planning  services  to  better 
adapt  to  citizens’  habits  and  mobility.  The  analysis  of  this  massive  amount  of 
crowd  sensed  data  (so-called  Big  Data)  within  the  cluster/cloud  infrastructure 
enables  detection  of  situations  and  events  that  influence  human  mobility,  and 
dissemination  of  notifications  and  recommended  actions. 
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Introduction 

With  advancements  and  the  proliferation  of  mobile  devices  with  increasing 
computing,  communication  and  sensing  capabilities,  mobile  users  have  become 
important  sources  of  sensing  data  in  a globally  spread  wireless  sensor  network 
(Campbell  et  al.  2008).  Mobile  crowd  sensing  represents  an  approach  for  a new 
sensing  and  geo-crowdsourcing  paradigm  that  leverages  the  power  of  various 
mobile  devices,  such  as  smartphones,  tablets,  wearable  devices,  smart  sensors, 
etc.  It  is  based  on  the  human  ability  to  acquire  local  geospatial  information  and 
knowledge  through  sensor-enhanced  mobile  devices,  and  the  possibility  to  share 
this  information/knowledge  with  other  users  and  a wide  community  (Goodchild 
2007;  Kamel  Boulos  et  al.  2011).  As  the  world  population  becomes  increasingly 
urban,  cities  worldwide  need  to  leverage  information  and  communications  tech- 
nologies to  improve  their  functions,  enhance  efficiency,  improve  competitiveness 
and  the  economy  and  provide  better  environment  for  their  citizens;  to  become 
smarter’.  Smart  mobility  is  one  of  the  key  characteristics  of  a smart  city,  and  also 
one  of  the  biggest  challenges  that  smart  cities  face  (Batty  et  al.  2012).  The  world 
population  makes  more  than  64%  of  all  travel  kilometers  within  urban  environ- 
ments, which  is  expected  to  triple  by  2050.  Thus,  a high  priority  for  cities  around 
the  world  is  to  support  citizens’  mobility  within  the  urban  environment  based 
on  safety,  efficiency  and  environmental  protection  (Motta  et  al.  2015).  Mobile 
crowd  sensing  paradigm  will  enable  key  methods,  techniques  and  systems  in 
that  direction.  The  application  of  analysis,  reasoning  and  data  mining  techniques 
on  the  mobile  crowd  sensed  data  provides  useful  insights  in  citizens’  mobility 
and  supports  better  citizen  involvement  in  monitoring  urban  spaces. 

In  this  article,  we  describe  main  concepts,  methods  and  technologies  of 
mobile  crowd  sensing  in  smart  cities  and  present  the  CitySensing  framework 
developed  to  support  smart  urban  mobility  through  participation  and  intel- 
ligence of  the  crowd  (Section  2).  Several  mobile  demo  applications  have  been 
developed  based  on  this  framework  that  illustrate  the  use  of  mobile  crowd  sens- 
ing in  different  city’s  scenarios  and  provide  insights  how  such  applications  could 
improve  citizens’  mobility  (Section  3).  Although  there  are  still  open  challenges 
and  real-life  deployments  of  developed  applications,  the  initial  evaluations  look 
promising  in  supporting  smart  citizens’  mobility,  as  concluded  in  Section  4. 

Mobile  crowd  sensing  for  smart  cities 

The  mobile  crowd  sensing  refers  to  geo -crowdsourcing  and  VGI  (Volunteered 
Geographic  Information)  paradigm  in  which  mobile  users  use  their  mobile 
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computing,  communication  and  sensing  devices  to  collect,  locally  process  and 
analyze,  as  well  as  distribute  geo -referenced  information  (Chatzimilioudis  et  al. 
2012).  Mobile  crowd  sensing  paradigms  have  become  the  significant  source  of 
VGI  and  crowdsourced  geo -information  owing  to  the  large  number  of  mobile 
devices  carried  by  people  worldwide  to  support  their  daily  activities  (Zhang 
et  al.  2014).  Such  crowd  sensed  data  is  filtered,  aggregated,  processed  and  ana- 
lyzed at  the  server(s)  and  appropriate  information  services  and  notifications 
are  delivered  to  mobile  users  to  support  their  mobility.  In  such  scenario  mobile 
users  are  both  the  producers  of  mobility  sensing  data,  and  the  consumers  of 
services,  notifications  and  recommendations  based  on  processing  and  analysis 
of  the  massive  amount  of  crowd  sensed  data  at  server (s). 

There  are  several  sources  of  mobility  sensing  data  originated  at  mobile 
devices,  classified  as: 

• Physical  sensors, 

• Virtual  (logical)  sensors, 

• Social  sensors. 

Physical  sensors  include  sensors  integrated  in,  or  attached  to  mobile  devices 
(smart  phones,  tablets,  etc.),  such  as:  GPS,  microphone,  camera,  ambient  light 
sensor,  accelerometer,  gyroscope,  compass,  proximity  sensor  and  also  the  tem- 
perature and  humidity  sensors  available  on  advanced  smartphones.  The  devel- 
opment of  wearable  and  pervasive  systems,  such  as  Sensordrone1  and  iWatch2, 
provides  integration  of  additional  sophisticated  sensors,  worn  by  users  and 
attached  to  their  mobile  devices,  to  measure  air  pollution,  personal  health 
parameters  and  the  emotional  and  physiological  status  of  users.  Virtual  sen- 
sors are  not  hardware  sensors  but  software  applications  that  run  at  user  devices 
and  collect  information  about  users,  their  profile  and  preferences,  detecting 
their  context  and  situation.  Such  sensors  detect  information  related  to  user 
communications  (voice,  SMS,  etc.),  user  activities  and  interaction  with  devices 
(active  applications,  application  in  focus,  the  type  of  the  interaction,  etc.),  user 
preferences  and  profile,  user-generated  content  (texts,  speech,  videos,  photos, 
sounds),  etc.  Virtually  sensed  information  is  referenced  in  space  and  time  and 
attached  to  a certain  location,  symbolic  or  geographic.  Social  sensors  detect 
user  social  status  and  activities,  social  network  and  social  media  interactions 
(tags,  likes,  Tweets,  photos,  etc.),  currently  connected  friends  and  their  sta- 
tus, connections  in  vicinity,  etc.  Some  of  such  information  can  be  detected  by 
accessing  social  network/media  services  through  appropriate  APIs. 

Mobile  crowd  sensed  data  collection  can  be  performed  with  the  participa- 
tion and  active  user  involvement,  called  participatory  sensing , or  without  active 
involvement  of  users,  i.e.  opportunistic  sensing  (Campbell  et  al.  2008).  The 


1 http://sensorcon.com/products/sensordrone-multisensor-tool 

2 https://www.apple.com/watch/ 
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research  of  mobile  crowd  sensing  for  smart  cities  focuses  on  the  collection, 
representation,  filtering,  aggregation,  processing  and  analysis  of  large  volumes 
of  mobility  sensing  data,  both  on  mobile  devices  and  at  central  server(s).  Such 
data  describe  citizens  movements  and  activities,  environmental  conditions  they 
face  (e.g.  high  pollution  level,  traffic  congestion,  crowded  or  noisy  surround- 
ings), communication  and  interaction  they  make  (talking,  searching  on  the 
Web,  tagging  a photo,  connecting  with  friends,  etc.).  The  analysis  and  mining 
of  mobility  data  obtained  through  mobile  crowd  sensing  provide  insights  into 
important  features  of  the  urban  mobility,  behavior  of  people  moving  around  a 
city  and  possible  prediction  of  future  mobility.  There  are  two  approaches  for 
collection  and  distribution  of  mobile  crowd  sensed  data: 

• Server-based  approach  - Sensed  data  in  its  raw  form  are  sent  to  a server 
without  processing  and  analysis  at  the  mobile  devices.  Such  an  approach 
results  in  very  high  demands  for  wireless  network  bandwidth,  as  well  as  for 
storage,  processing  and  analysis  of  raw  mobility  data  at  the  server. 

• Distributed  approach  - Data  from  physical,  virtual,  and  social  sensors  are 
collected,  processed  and  analyzed  at  the  mobile  devices  and  high  level  con- 
textual and  mobility  information  are  sent  to  a server. 

Mobile  crowd  sensed  data  and  information  represent  the  foundation  for  mobil- 
ity information  services  that  are  delivered  to  the  same  users  to  support  their 
smarter  mobility  (Ilarri  et  al.  2015).  The  crowd  sensed  data  collected  from  a 
large  number  of  mobile  users/moving  objects  are  characterized  by  massive  vol- 
ume, velocity  and  variety,  and  can  be  regarded  as  Big  Data.  The  server  applica- 
tions running  on  a cluster/ cloud  that  support  mobile  crowd  sensing  should  pro- 
vide data  collection,  processing,  reasoning,  analysis  and  mining  over  massive 
mobility  data  sets,  from  structured  and  unstructured  sources.  Such  a collabora- 
tive crowd  intelligence  provides  detection  of  aggregated  mobility  patterns  and 
trajectories,  group  activities  and  behavior,  as  well  as  complex  situations  (e.g. 
interesting  places,  traffic  congestions,  popular  city  routes  or  crowded  evacu- 
ation paths  in  an  emergency  situation)  in  smart  cities  (Cardone  et  al.  2013). 

We  have  developed  a framework,  named  CitySensing , to  support  the  devel- 
opment of  mobile  crowd  sensing  applications  for  various  smart  city  scenarios. 
The  CitySensing  framework  consists  of  mobile  application  components,  server 
components  and  visualization/ analytics  components  organized  in  a distributed 
architecture  given  in  Figure  1.  It  supports  both  opportunistic  and  participa- 
tory methods  of  crowd  sensed  data  collection  using  various  physical,  virtual 
and  social  sensors  available  in  todays  mobile  devices.  It  fully  leverages  process- 
ing, sensing  and  communication  capabilities  of  mobile  devices  and  provides 
distributed  and  scalable  storage,  processing,  analysis  and  mining  of  crowd 
sensed  mobility  data  at  mobile  devices  within  CitySensing  mobile  components. 
High-level  mobility  information  generated  at  users’  devices  is  further  aggre- 
gated, processed  and  analyzed  at  the  CitySensing  server  components  running 
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Figure  1:  A general  architecture  of  CitySensing  framework. 


on  a cluster/cloud  infrastructure.  The  CitySensing  framework  is  based  on  the 
open-source  sensor  data  collection  framework,  Funf,3  and  the  data  analysis  and 
mining  framework  WEKA4.  For  processing  and  analysis  of  big  spatio-tempo- 
ral data,  CitySensing  server  components  are  based  on  distributed  processing 
frameworks,  such  as  MapReduce/Hadoop5  and  Spark,6  for  processing  offline 
crowd-sensed  data  stored  in  a distributed  file  system  over  a cloud/cluster.  For 
real-time  processing  and  analysis  of  large  crowd  sensed  data  streams  appropri- 
ate data  stream  processing  frameworks  are  employed,  such  as  Apache  Storm7 
and  Spark  Streaming. 


Dynamic  crowd-sensed  information  for  smart  mobility 

Smart  mobility  in  urban  environments  is  based  on  integrated  mobile  infor- 
mation services  to  support  efficient,  safe  and  green  transport  of  people  and 
goods  and  user  activities  in  smart  cities.  We  are  developing  several  demo  smart 


3 http://www.funf.org/ 

4 http : / / www. cs .waikato.  ac. nz/ ml/ weka/ 

5 https://hadoop.apache.org/ 

6 https://spark.apache.org/ 

7 https://storm.apache.org/ 
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mobility  applications,  namely  DriveSense , ExposureSense  and  UrbanSense , 
based  on  CitySensing  framework,  to  support  different  urban  mobility  scenarios. 


Crowd-sensing  traffic  information 

Recent  research  on  smart  mobility  has  started  to  leverage  the  principle  of  mobile 
crowd  sensing  by  employing  drivers/passengers  and  pedestrians  equipped  with 
mobile  devices  as  real-time  sources  of  navigation,  environmental  and  traffic 
information.  The  basic  idea  implemented  in  DriveSense  application  is  to  use  driv- 
ers/passengers and  pedestrians  as  moving  sensor  platforms  that  can  collect  traf- 
fic, driver  and  road  related  conditions,  events  and  information  and  send  them  to  a 
central  server,  or  to  mobile  devices  in  the  vicinity,  using  appropriate  wireless  com- 
munication mechanism  (3G,  Wi-Fi,  Bluetooth,  etc.).  Such  dynamic  information 
should  support  citizens  in  their  everyday  mobility,  both  indoors  and  outdoors. 

Regarding  vehicle  transport,  mobile  crowd  sensing  systems  could  detect  and 
report  traffic  events  and  conditions,  the  state  of  road  infrastructure,  driver 
behavior  and  activities,  as  well  as  accidental  events: 

• Traffic  events  and  condition 

- Travel  times  over  the  street  segments. 

- Slowed  down  traffic  (traffic  jams). 

- Congestion  points  (start-stop  locations) 

- Parking  places. 

- Traffic  stops. 

• Road  infrastructure  state 

- Bumpy  road. 

- Slippery  road  surface. 

- Damaged  road  surface  location  (potholes). 

• Dynamic  traffic  events  and  driver  behavior  (accidents  and  situations  that 

could  potentially  cause  accidents) 

- Sudden  breaking,  decelerations  and  acceleration. 

- Lateral  skidding. 

- Violent  (sudden)  change  of  direction  and  traffic  lane  at  a high  speed. 

- A vehicle  moved  out  of  a road;  an  accident  or  a collision  happened,  etc. 

The  DriveSense  application  is  used  as  a tool  for  crowd  sensing  of  dynamic  traf- 
fic information,  and  also  as  dynamic  navigation  service  based  on  current  traffic 
conditions.  Both  participatory  and  opportunistic  modes  can  be  used  for  collec- 
tion of  information  on  traffic  condition  and  events  (Figure  2a).  During  driving, 
relevant  traffic  events  (e.g.  sudden  lane  changes,  numerous  hard  breaking  events, 
etc.)  are  reported  by  moving  users  and  delivered  to  all  drivers  in  affected  street 
segments.  As  a result,  a service  also  proposes  re-routing  to  drivers  to  avoid  an 
intersection  that  is  probably  heavily  congested,  or  potentially  unsafe  for  driving. 
Spatio-temporal  aggregation  and  analysis  of  crowd  sensed  traffic  information 
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Figure  2:  DriveSense  application  a)  Mobility  information  service  b)  Space-time 
clustering  of  reported  traffic  events. 


collected  over  longer  time  periods  can  provide  valuable  information  to  city  traf- 
fic service  and  identify  parts  of  city’s  street  network  that  could  affect  safe  and 
efficient  traffic.  The  results  of  this  analysis  are  shown  in  Figure  2b  showing  parts 
of  a street  network  with  detected  potholes  (blue  dots),  areas  with  excessive  harsh 
breaking  (red  dots)  and  frequent  violent  lane  changes  (green  dots). 


Air  quality  sensing  and  reporting 

Air  pollution  is  identified  as  a major  health  concern  in  urban  areas.  The  citizens 
are  also  exposed  to  other  types  of  urban  pollutions’  such  as  noise  and  electromag- 
netic waves.  By  applying  crowd  sensing  concepts  to  users  with  mobile  devices, 
both  citizens  and  city  authorities  can  detect  and  be  aware  of  pollution  exposure 
during  their  everyday  activities  in  urban  areas,  especially  for  pedestrians,  as  dem- 
onstrated by  ExposureSense  application  (Figure  3).  As  with  previous  DriveSense 
example,  users  of  ExposureSense  act  as  both  sources  of  information  on  air  quality 
and  consumers  of  such  information  which  support  planning  their  daily  physical 
activities  in  city  areas  with  less  exposure  to  pollution.  By  cross -correlating  their 
physical  activities  with  pollution  information,  users  can  receive  information  and 
services  regarding  their  pollution  exposure  and  recommendation  for  activities,  as 
well  as  navigation  instructions  to  cleaner’  areas  (Predic  et  al.  2013). 

The  ExposureSense  application  can  act  as  a personal  diary  of  physical  activi- 
ties giving  both  map  view  (Figure  3a)  and  timeline  view  (Figure  3b).  This  infor- 
mation can  be  combined  with  sensed  air-quality  parameters  (like  temperature, 
concentration  of  different  particles  or  gasses  in  the  air).  Figure  3c  shows  map 
visualization  of  N02  concentration  sensed  using  sensors  attached  to  a mobile 
device  during  typical  recreation  activity.  Summarized  information  is  shown  in 
a calendar  type  view  in  Figure  3d. 

The  drivers/passengers  and  pedestrians  are  also  the  consumers  of  smart 
mobility  information  services  generated  by  such  crowd  intelligence  approach. 
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Figure  3:  ExposureSense  mobile  crowd  sensing  application  a)  Map  view  of 
activities  b)  Timeline  view  of  activities  c)  Exposure  to  N02  during  recreation 
activity,  d)  Summary  information  of  activities  and  exposures  to  pollutants. 


They  receive  real-time  traffic  information,  navigation  instructions  and  notifica- 
tions, recommended  activities  and  environmental  conditions  and  actuate  upon 
them  (Predic  & Stojanovic  2015).  City  authorities  can  use  such  crowd  intelli- 
gence to  improve  urban  mobility  by  detecting  mobility  behaviors  and  patterns 
of  citizens/tourists  movement.  Such  behavior  and  movement  patterns,  related 
to  background  geographic  information  (POI,  road  network  data,  indoor  maps), 
time  context  (time  of  the  day,  day  of  the  week,  month,  season),  weather  condi- 
tions, means  of  transport,  environmental  conditions,  social/cultural  events  in 
the  city  or  indoors,  can  enable  better  understanding  of  the  city  dynamics. 

Behavior  patterns  sensing  using  mobile  activity  collection 

Ubiquitous  and  transparent  collection  of  mobile  usage  statistics  as  geo -refer- 
enced mobility  data  can  provide  valuable  information  to  urban  planning  city 
services.  Crowd  sensed  information  showing  urban  areas  where  mobile  users 
tend  to  spend  their  leisure/break  time,  types  of  physical  activities  during  parts 
of  a day,  and  their  physical  environment  during  these  activities  can  allow  city 
planners  to  adapt  city’s  business  and  recreational  areas  according  to  citizens’ 
mobility  in  urban  environment.  The  UrbanSense  mobile  application  has  been 
developed  as  a mobile  diary  service  that  detects  and  stores  information  about 
users’  environment  and  mobile  interaction  by  collecting  following  data: 

• Physical  environment:  temperature,  air  pressure,  air  humidity,  light  inten- 
sity, etc. 

• Phone  communication  usage:  SMS  sending/receiving  and  phone  calls. 

• User’s  physical  activities. 

• Wireless  communication  devices  in  surrounding,  e.g.  WiFi  access  points, 
cellular  network  base  stations,  Bluetooth  devices,  other  mobile  devices,  etc. 


Mobile  crowd  sensing  for  smart  urban  mobility  379 


• Active  usage  of  a phone  based  on  a screen  on/off  state. 

• Usage  of  certain  types  of  mobile  applications  especially  social  networking  ones. 

• Recording  of  photos/videos  using  phone  cameras. 

• Playback  of  sound/video  files. 

• Battery  level  / charging  data. 

The  UrbanSense  demo  application  consists  of  a mobile  service  tasked  with  col- 
lection of  sensing  data,  a mobile  application  used  for  service  configuration  and 
access  to  mobility  information  services,  as  well  as  a server(s)  which  receives, 
stores,  aggregates,  processes  and  analyses  crowd  sensed  data.  The  mobile  service 
collects  data  in  a local  mobile  database  in  predefined  time  intervals  (usually 
daily)  and  sends  the  collected  data  to  the  server.  A service  configuration  allows 
the  user  to  filter  which  types  of  activity  information  can  be  collected  during  spe- 
cific periods  of  a day/ week,  for  the  reason  of  privacy  preservation  (Figure  4a). 
All  sensitive  personal  information  in  collected  data  is  hashed  in  order  to  pre- 
serve data  topology  and  semantics  while  maintaining  anonymity  and  privacy. 
This  includes  phone  numbers  and  SMS  messages  content.  Prior  to  submitting 
collected  data  to  the  server  user  can  view  and  filter  it  in  the  mobile  application, 
as  shown  in  Figure  4b.  At  the  server,  collected  data  are  processed  and  analyzed 
by  applying  spatio-temporal  clustering  techniques  and  social  connections  based 
on  SMS  and  voice  calls  history  are  detected.  After  clustering,  urban  areas  with 
intensive  physical  and  communication  activities  can  be  visualized  on  a map  and 
filtered  according  to  the  type  and  the  period  of  day/week  (Figure  4c). 

The  main  goal  of  UrbanSense  application  is  to  infer  social  network  connec- 
tions and  detect  urban  areas  with  intensive  social  interaction,  or  which  contain 
a specific  mobile  usage  pattern.  The  identification  of  such  areas,  along  with 
time  periods  when  certain  patterns  of  social  activities  occurred,  can  be  used  for 
urban  planning,  coverage  of  city  areas  with  marketing,  sports  or  leisure  events 
and  supporting  infrastructure. 


Figure  4:  UrbanSense  application  a)  Data  collection  configuration  b)  Overview 
and  filtering  of  collected  data  c)  Spatio-temporal  clustering  of  collected  com- 
munication data. 
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Server-side  processing  techniques  for  crowd-sourced  spatial  data 

Using  their  mobile  devices  while  walking  and  driving,  and  participating  in  social 
networks  and  media,  citizens  provide  a wealth  of  information  related  to  their 
mobility  in  urban  environments  and  can  support  numerous  applications  for  the 
benefit  of  the  urban  community.  Smart  mobility  crowd  sensing  applications  pre- 
sented so  far  are  based  on  local  processing  and  analysis  of  mobile  sensor  data  to 
detect  high  level,  semantic  mobility  information  at  mobile  devices.  Such  infor- 
mation is  transferred  to  central  server(s)  that  perform  aggregation  of  semantic 
mobility  data  collected  from  a large  number  of  mobile  users  and  provide  query 
processing,  reasoning,  analysis  and  mining,  over  big  mobility  data  sets.  The 
server  performs  detection  of  aggregated  mobility  patterns  and  trajectories,  col- 
lective activities  and  behavior,  as  well  as  complex  mobility  situations  (e.g.  traffic 
congestions,  risky  and  dangerous  events,  frequent  city  routes,  crowded  evacua- 
tion paths  in  an  emergency  situation,  popular  places  and  orders  of  their  visit,  etc.) 
(Ilarri  et  al.  2015).  Mobile  crowd  sensed  data  processing,  analysis  and  visualiza- 
tion, either  for  single  citizen  or  aggregated  set  of  citizens  through  time,  enables 
early  detection  of  situations  and  events  that  affect  urban  mobility,  as  well  as  the 
estimation  of  their  impact  to  the  community.  Also,  they  provide  dissemination 
of  proactive  location-based  and  context-aware  services,  notifications  and  alert 
messages,  as  well  as  recommendations  for  the  actions  to  citizens  on  the  move. 

Crowd  sensed  data  collected  from  UrbanSense  mobile  application  can  be 
analyzed  and  visualized  to  detect  user  activities  on  the  move.  The  timeline  and 
a map  overlay  visualization  of  per-user  activity  data  collected  by  UrbanSense 
application  is  shown  in  Figure  5a. 

Crowd  sensed  data  from  numerous  mobile  citizens  are  streamed  to  server 
applications  based  on  big  data  processing  and  analysis  frameworks  deployed 
over  distributed  computing  infrastructure,  such  as  MapReduce/Hadoop, 
Apache  Spark  and  Apache  Storm  and  Web -based  visual  analytics  components 
and  technologies.  Such  architecture  allows  building  of  flexible  and  scalable 
crowd- sourcing  software  services  that  can  effectively  analyze  citizens’  mobil- 
ity patterns  in  modern  cities.  In  Figure  5b,  a heavy  traffic  on  major  city  streets 
(different  levels  of  red  color)  detected  by  crowd- sourcing  traffic  information  is 
co-occurred  with  bursts  of  ‘traffic’  on  Twitter  around  the  same  locations  (blue 
colored  circles).  The  contents  of  Tweets  could  offer  a clear  and  fast  explanation 
of  the  traffic  jam,  e.g.  a protest  of  local  farmers. 


Conclusions 

Presented  demo  mobile  applications  confirm  that  mobile  crowd  sensing  rep- 
resents a very  promising  approach  for  mobilization  of  citizens  in  order  to 
improve  and  adapt  their  urban  mobility.  Demonstrated  and  other  freely  avail- 
able mobile  crowd  sensing  applications  suggest  that  crowd- sourcing  dynamic 
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Figure  5:  Visualization  and  analysis  of  crowd  sensed  mobility  data  a)  Citizens 
activities  visualized  on  a timeline  and  a city  map  b)  Traffic  congestion  and 
Twitter  activities  in  Nis  based  on  Storm  server  application. 


geographic  information  has  potential  to  improve  various  aspects  of  urban  life 
and  urban  mobility  In  this  paper  we  focused  on  probably  the  most  prominent 
domains  such  as  traffic,  air  quality  and  citizens  activities  on  the  move  that  have 
the  great  impact  on  citizens  mobility  and  city  dynamics.  Presented  research 
work  demonstrates  that  other  aspects  of  urban  life  and  planning  can  also  ben- 
efit from  crowd- sourcing  geographic  information. 

There  are  still  a number  of  open  issues  and  research  challenges  in  mobile 
crowd  sensing  approach  applied  to  smart  cities.  To  ensure  the  citizens  par- 
ticipation, reliability,  security  and  privacy  must  be  covered  in  the  future  work. 
Approproate  methods  and  tools  should  be  investigated  and  developed  to  ensure 
the  accuracy  of  the  information  collected,  an  adequate  level  of  privacy  to  citi- 
zens, as  well  as  citizens  motivation  through  gamification  and  micro  payments. 
Also,  the  future  research  must  take  into  account  the  high  volume,  velocity, 
variety  and  veracity  of  data  provided  by  the  citizens,  and  collected  from  other 
structured  and  unstructured  data  sources  (e.g.  sensor  networks,  Internet  of 
Things,  etc.).  Accordingly,  advanced  methods  and  techniques  for  processing, 
analysis  and  visualization  of  such  Big  mobility  data  should  be  developed,  that 
will  unearth  new  information  and  knowledge  by  exploring  the  correlation  and 
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patterns  across  multi-sources  of  crowd  sensed  data.  It  will  provide  more  effec- 
tive services  and  solutions  for  citizens  and  decision  makers  in  urban  mobility. 
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Abstract 

When  travelling  in  space,  humans  perceive  the  environment  and  evaluate  it 
affectively.  This  chapter  illustrates  how  mobile  crowdsourcing  and  social  media 
data  can  be  used  to  study  peoples  affective  responses  to  different  environments. 
It  also  showcases  how  these  affective  responses  can  be  used  to  provide  a better 
understanding  of  human-environment  interaction,  as  well  as  to  enable  smart 
geospatial  applications  (particularly  navigation  systems).  This  chapter  also  dis- 
cusses some  essential  challenges  that  need  further  investigations  when  crowd- 
sourcing peoples  affective  responses.  Some  of  these  challenges  are  participation 
motivating,  data  quality  and  privacy. 
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Introduction 

When  travelling  in  space,  people  perceive  the  environment  and  interpret  it 
affectively.  A lot  of  our  daily  behaviour  and  decision  making  is  influenced  by 
this  kind  of  interpretation  and  affective  evaluation  of  space  (Kaplan  & Kaplan 
1989;  Borst  et  al.  2009).  For  example,  during  wayfinding,  which  route  to  choose 
and  which  one  to  avoid  may  not  only  depend  on  distance,  but  also  be  influ- 
enced by  peoples  affective  evaluation  of  the  environment,  such  as  safety  and 
attractiveness.  Studying  peoples  affective  responses  to  environment  contributes 
to  a better  understanding  of  peoples  spatial  experiences  and  behaviour,  as  well 
as  enables  many  applications,  such  as  navigation  systems  and  urban  planning. 

Conventionally,  affect  has  been  a central  research  topic  within  modern  psy- 
chology. There  were  also  many  geographers,  especially  human  geographers, 
approaching  affective  responses  to  environments  when  studying  place.  These 
conventional  studies  often  employed  empirical  experiments  in  labs  or  in  fields, 
which  are  often  very  expensive  and  time  consuming,  and  hard  to  apply  for  large- 
scale  studies.  In  recent  years,  with  the  increasing  availability  of  smartphones, 
researchers  start  to  apply  the  principle  of  citizens  as  sensors  (Goodchild  2007), 
and  explore  the  potential  of  crowdsourcing  affective  experiences  from  a large 
group  of  people.  Furthermore,  with  the  impetus  of  social  networking  services 
(e.g.  Facebook,  Foursquare,  Flickr,  and  Twitter),  large  volumes  of  social  media 
data  have  been  continually  created.  These  data,  especially  geotagged  ones, 
contain  lots  of  information  about  peoples  experiences  and  activities  in  vari- 
ous environments,  which  is  a new  and  significant  source  for  studying  peoples 
spatial  experiences  at  different  environments. 

This  chapter  presents  our  on-going  research  efforts  towards  these  trends. 
We  illustrate  how  mobile  crowdsourcing  and  social  media  data  analysis  can 
be  used  to  collect  peoples  affective  responses  to  space,  as  well  as  the  potential 
applications  of  these  affective  data. 


Affective  responses  to  environments 

According  to  American  Psychological  Association  (2006),  affect  is  the  expe- 
rience of  feeling  or  emotion.  It  is  a key  part  of  the  process  of  an  organisms 
interaction  with  stimuli  (e.g.  the  environment).  Russell  (2003)  argues  that  some 
affective  response  is  always  present  within  a person,  as  neutral,  moderate  or 
extreme. 

There  are  many  studies  in  literature  focusing  on  structuring  or  modelling 
peoples  affective  responses  (Ekman  and  Friesen  1971,  Russell  2003,  Barrett 
2006).  The  first  group  of  studies  tries  to  define  basic  distinct  affective  responses 
such  as  happiness,  anger,  fear,  disgust  or  sadness  (e.g.,  Ekman  & Friesen  1971). 
This  approach  was  widely  applied  in  the  field  of  Affect  Computing’,  which 
uses  these  basic  affective  responses  to  label  facial  expressions,  and  adapts 


Using  mobile  crowdsourcing  and  geotagged  social  media  data  387 


services  according  to  these  labelled  affective  responses.  Others  adopt  a dimen- 
sional approach  to  study  peoples  affective  responses,  and  argue  that  affective 
responses  can  be  represented  using  some  principal  dimensions.  For  example, 
Russell  and  Barrett  (1999)  and  Russell  (2003)  vary  affective  responses  on  the 
dimensions  of  valence  (i.e.  ‘the  subjective  positive-to-negative  evaluation,  e.g. 
pleasant-unpleasant,  comfortable-uncomfortable),  and  arousal  (i.e.  ‘a  sense  of 
mobilization  or  energy’,  e.g.  activating-deactivating).  There  are  other  studies 
proposing  to  include  one  more  dimension  - motivational  intensity  (e.g.  the 
strength  of  an  urge  to  move  toward  or  away  from  a stimulus)  (Harmon-Jones 
et  al.  2013).  Barrett  (2006)  stated  that  valence  is  a basic  component  of  affec- 
tive responses  (all  people  seem  to  be  able  to  differentiate  between  pleasant 
and  unpleasant  affective  states),  while  the  degree  of  arousal  may  not  be  always 
experienced. 

In  this  paper,  we  are  particularly  interested  in  affective  responses  evoked  by 
environments.  According  to  Russell  (2003)  and  Barrett  et  al.  (2007),  environ- 
ments are  perceived  not  only  according  to  their  physical  features,  but  also  affec- 
tively. Some  places  are  experienced  as  pleasing,  while  some  others  as  disgust- 
ing and  unsafe.  These  affective  responses  to  environments  are  experienced  as 
attributes  or  qualities  of  environments,  which  are  commonly  described  with 
affect-related  adjectives  such  as  boring,  safe,  and  beautiful. 


Acquisition  of  affective  responses 

Affective  responses  to  environments  can  be  collected  through  various 
approaches,  among  which  self  reports  and  physiological  recordings  are  the 
most  prominent  conventional  methods  for  collecting  such  data.  However, 
these  approaches  were  traditionally  used  in  laboratories  with  highly  controlled 
conditions.  Recently,  with  the  rapid  development  of  new  technologies,  novel 
methods  and  sensors  have  become  available  for  collecting  affective  responses 
to  environments  (Huang  et  al.  2014;  Resch  et  al.  2015).  In  the  following,  we 
highlight  two  of  these  current  approaches. 


Mobile  crowdsourcing 

Self-reports  are  the  most  direct  way  to  gather  information  about  people’s  affec- 
tive responses  (Barrett  et  al.  2007).  Recently,  the  ubiquitous  availability  and  use 
of  smartphones,  and  the  rapid  spread  of  Web  2.0  potentially  enable  researchers 
to  collect  self-reported  affective  responses  from  a large  group  of  people.  In  the 
EmoMap  project  (Klettner  et  al.  2013),  we  applied  this  mobile  crowdsourcing 
approach  to  acquire  people’s  affective  responses  via  GPS-enabled  smartphones. 

To  help  people  report  their  affective  responses  via  smartphones,  we  devel- 
oped an  Affect- Space -Mo del.  Based  on  the  dimensional  approach  described 
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above,  the  model  adopts  a two-level  structure.  Firstly,  users  are  asked  to  rate 
their  ‘level  of  comfort’  (i.e.  the  valence  dimension)  in  their  current  environ- 
ment on  a 7-point  Likert  scale,  from  uncomfortable  (‘1’)  to  comfortable  (‘7’). 
Secondly,  users  can  optionally  provide  further  details  about  their  affective 
responses,  particularly  on  the  aspects  of  safety,  attractiveness,  diversity,  and 
relaxation.  These  four  aspects  were  obtained  from  an  empirical  study  (Huang 
et  al.  2014).  In  summary,  each  affective  response  to  environments  is  collected 
as  ratings  on  these  five  aspects  (i.e.  level  of  comfort,  safety,  attractiveness,  diver- 
sity, and  relaxation).  Each  affective  response  is  then  annotated  with  the  GPS 
location  obtained  from  user’s  smartphone. 

The  Affect- Space -Mo del  was  implemented  in  an  Android  mobile  application 
(Figure  1)  to  enable  people  to  report  their  affective  responses  to  environments 
anytime  and  anywhere.  To  help  users  locate  themselves,  this  application  shows 
the  current  location  (obtained  from  GPS  on  smartphones)  as  a marker  in  an 
OpenStreetMap.  Following  the  suggestions  by  Russell  (2003),  it  particularly 
asks  the  question  ‘Here  it  is  . . .’  to  explicitly  make  users  focus  on  the  environ- 
ment. To  better  understand  users’  affective  responses,  some  contextual  infor- 
mation was  also  collected,  i.e.  company,  familiarity  with  the  place,  time  and 
weather  (obtained  from  the  Web).  To  encourage  people’s  active  contributions, 
this  application  was  promoted  to  students,  as  well  as  to  an  urban  walking  com- 
munity.1 Users  were  asked  to  contribute  their  ratings  anywhere  and  anytime 
they  want  in  their  daily  life. 


Figure  1:  Main  screenshots  of  the  mobile  application  for  collecting  people’s 
affective  responses  to  environments  (Map  data:  OpenStreetMap  and  Con- 
tributors, CC-BY-SA). 


http://www.wildurb.at/ 
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Experiences  within  the  EmoMap  project  show  that  with  this  mobile  crowd- 
sourcing approach,  we  are  able  to  acquire  affective  data,  which  are  evoked  by 
realistic  scenarios,  directly  from  a large  number  of  people.  Until  December 
2013,  more  than  3,500  contributions  were  collected  from  more  than  200  peo- 
ple. The  contributions  were  distributed  at  different  places  of  the  world,  with 
98%  of  them  for  the  city  of  Vienna  (Austria).  As  an  example,  we  visualize  part 
of  these  contributions  for  the  city  of  Vienna  in  Figure  2. 


Extracting  affective  responses  from  social  media  data 

Recently,  the  increasing  availability  of  social  media  (e.g.  Foursquare  and  Flickr) 
has  led  to  the  accumulation  of  large  volumes  of  open  social  media  data.  Recent 
research  has  started  to  use  social  media  data  for  studying  peoples  affective 
responses  (Mislove  et  al.  2010;  Hauthal  & Burghardt  2013).  However,  these 
approaches  extract  peoples  affective  responses  in  various  environments,  which 
might  not  necessary  be  caused  by  or  towards  these  environments.  In  the  follow- 
ing, we  extend  existing  research,  and  illustrate  how  social  media  data  can  be 
harnessed  to  extract  peoples  affective  responses  to  environments.  Particularly, 
we  focus  on  geotagged  photos  in  Flickr. 

For  extracting  affective  responses  from  social  media  data,  we  apply  sentiment 
analysis  technique.  Sentiment  analysis  (or  opinion  mining)  is  a natural  language 
processing  (NLP)  technique,  and  aims  to  determine  an  author  s attitudes,  opin- 
ions or  sentiments  with  respect  to  the  topic  written  about.  Different  methods 
have  been  proposed  for  sentiment  analysis,  among  which  lexicon-based  method 
is  one  of  the  most  popular  ones.  Lexicon-based  sentiment  analysis  employs  NLP 
techniques  to  tokenize,  split,  and  lemmatize  text  into  a list  words,  and  use  word 
lexicons  to  determine  the  affective  valence  of  these  words  as  well  as  the  valence 
of  the  text.  Several  affective  word  lexicons  exist  for  this  purpose,  such  as  Affec- 
tive Norms  for  English  Words  (ANEW)  (Bradley  & Lang  1999)  and  Finn  Arup 
Nielsens  word  list  (AFINN)  (Nielsen  2011).  These  word  lexicons  contain  lists  of 
English  words  with  a valence  score  attached  to  each.  For  example,  in  AFINN,  on 
a scale  of  -5  (negative,  unpleasant)  and  +5  (positive,  pleasant),  “nice”  is  rated  as 
+3,  and  terrible  as  -3.  In  this  research,  we  use  AFINN  for  the  sentiment  analysis, 
this  is  mainly  because  AFINN  is  particularly  designed  for  microblogs  and  social 
media,  and  has  been  used  by  many  other  researchers  (See  http://neuro.imm. 
dtu.dk/wiki/AFINN  for  a list  of  studies  using  AFINN). 

Due  to  the  different  natures  of  different  languages,  we  only  focus  on  Flickr 
photos  with  title/description  in  English  language.  To  achieve  this  aim,  we  used 
an  open  language  detection  library  provided  by  Cybozu  Labs,  Inc.,  which  has 
a precision  of  99%  for  53  languages.2  According  to  the  observation  of  Kisi- 
levich  et  al.  (2010),  the  language  pattern  adjective-noun  is  the  simplest  and 


2 https://github.com/shuyo/language-detection. 
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most  popular  example  in  English  language  to  describe  the  characteristics  of  an 
object.  For  example,  a beautiful  place’  and  a dirty  street’  Therefore,  to  extract 
affective  responses  to  environments  from  geotagged  photos  in  Flickr,  we  pro- 
pose a lexicon-based  sentiment  analysis  method,  as  depicted  in  Figure  3. 

1)  Firstly,  for  each  geotagged  photo,  we  use  Stanford  CoreNLP  1.3.4  library 
(http://stanfordnlp.github.io/CoreNLP/)  to  tokenize,  split  and  lemmatise 
its  title  and  description,  and  clean  the  sentences  by  removing  English  stop 
words  (e.g.  ‘a’,  ‘by’  and  ‘since’).  A part-of- speech  (POS)  tagging  process 
is  also  applied.  Results  of  these  steps  are  a list  of  words  and  their  lexical 
category  (e.g.  noun,  verb  and  adjective). 

2)  Secondly,  we  extract  adjective-noun  sets,  e.g.  ‘interesting  building’. 
Again,  Stanford  CoreNLP  library  is  applied.  Results  of  this  step  is  a list  of 
adjective-noun  sets. 

3)  Thirdly,  we  filter  out  adjective-noun  sets  that  are  not  place-related.  To 
achieve  this  aim,  a list  of  place  nouns  is  created,  which  consists  of  English 
place  nouns  (e.g.  ‘building’,  ‘street’,  ‘restaurant’,  ‘park’,  ‘museum’,  ‘opera’...) 
and  study- area  specific  place-names  from  GeoNames  (http://www. 
geonames.org/)  (e.g.  ‘Stephansdom’,  ‘Karlskirche’).  If  the  noun  part  of  an 
adjective-noun  set  is  not  in  the  list,  we  consider  the  adjective-noun  set  as 
non-place-related. 

4)  Finally,  for  each  adjective  within  the  remaining  adjective-noun  sets,  we 
check  whether  it  is  in  the  AFFIN  affective  lexicons.  If  yes,  we  assign  the 
valence  value  to  the  photo.  Otherwise,  we  use  Java  WordNet  Library  to 
get  synonyms  of  the  adjective,  and  check  whether  one  of  the  synonyms  is 
in  the  AFFIN  affective  lexicons.  If  yes,  the  valence  value  is  also  assigned 
to  the  photo.  For  each  photo,  we  then  average  all  the  valence  values  of 
its  title  and  description,  and  assign  the  result  as  the  valence  value  of  this 
photo. 

This  lexicon-based  sentiment  analysis  method  was  applied  to  the  Flickr  photos 
uploaded  for  the  city  of  Vienna  (Austria)  in  the  period  of  January  2007  and 
October  2011,  which  contain  107,353  data  rows.  Figure  4 shows  the  results.  As 
can  be  seen  from  the  results,  different  places  are  connected  with  different  affec- 
tive responses.  However,  to  interpret  these  results,  one  should  consider  the  bias 
brought  by  Flick,  especially  the  contributing  inequality. 

Several  limitations  of  the  proposed  sentiment  analysis  method  should  be  also 
pointed  out.  Firstly,  it  ignores  slang  and  incomplete  words,  which  often  exist 


Figure  3:  Workflow  of  extracting  affective  responses  to  environments. 
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yellow  being  a bit  negative  and  red  being  very  negative. 
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in  Flickr  photo  title  or  description.  Secondly,  this  method  uses  a very  simple 
language  pattern  (i.e.  adjective-noun  set)  to  identify  affective  responses  caused 
by  the  environment.  It  should  be  improved  to  consider  more  language  patterns. 
Furthermore,  the  method  should  be  improved  to  check  whether  the  geoloca- 
tion of  the  photo  actually  matches  the  place  the  user  is  referring  to  in  the  title/ 
description  of  the  phone.  This  will  help  to  address  the  problem  of  location  accu- 
racy, as  well  as  discover  photos  that  are  taken  for  remote  environment  objects 
(pictures  of  the  Eiffel  tower  might  not  be  taken  directly  at  the  Eiffel  tower’). 


Selected  applications  of  affective  responses 

In  general,  collecting  affective  responses  from  a large  number  of  people  enables 
many  innovative  applications.  They  can  not  only  be  used  to  understand  peoples 
behaviour  at  different  environments,  as  studied  in  environmental  psychology 
and  GIScience,  but  also  bring  benefits  to  other  disciplines  and  application  fields, 
such  as  Information  and  Communication  Technology,  urban  planning,  architec- 
ture and  policy  making.  In  the  following,  we  illustrate  two  different  case  studies. 

The  first  case  study  used  the  collected  affective  data  (see  Section  3.1)  to  study 
the  impact  of  environmental  characteristics  on  people’s  affective  responses.  We 
focused  on  the  area  surrounding  the  Vienna  University  of  Technology  (Aus- 
tria), mainly  due  to  the  diverse  environmental  settings  within  this  area.  We 
subdivided  the  area  into  three  distinctive  urban  scenes  according  to  their  level 
of  traffic  and  vegetation:  A)  green  urban  area  (urban-green),  B)  urban  area  with 
light  or  no  traffic  (pedestrian  lanes  and  one-lane  street,  urban-light  traffic)  and 
C)  urban  area  with  heavy  traffic  (roads  ranging  from  two  to  three-lanes,  urban- 
heavy  traffic).  These  three  urban  scenes  are  compared  according  to  the  level  of 
comfort  ratings  reported  by  the  participants  (Figure  5).  The  results  suggest  that 
the  level  of  comfort  ratings  significantly  differ  between  the  three  environmental 
settings  (H(2)=  103.4,  P < 0.001).  For  more  details  on  the  data  analysis,  refer  to 
Klettner  et  al.  (2013). 

As  can  be  seen  from  Figure  5,  urban  green  areas  show  the  most  positive  rat- 
ings among  the  three  urban  scenes,  followed  by  areas  of  urban-light  traffic. 
Urban  areas  with  heavy  traffic,  on  the  other  hand,  show  highly  negative  ratings. 
However,  we  argue  that  in  order  to  draw  a clearer  conclusion,  more  research 
should  be  done  on  this  aspect,  e.g.  a further  classification  of  the  study  area, 
consideration  of  other  contextual  factors  (such  as  time),  validation  of  the  qual- 
ity of  the  affective  data  collected,  and  investigation  of  the  bias  caused  by  the 
participation  inequality’  (see  Section  5 for  more  discussions). 

Our  second  case  study  focused  on  how  the  collected  affective  data  can  be  used 
to  design  smart  human-centred  geospatial  applications,  particularly  navigation 
systems.  Navigation  systems  aim  to  provide  users  with  navigation  guidance 
when  visiting  unfamiliar  environments  (Gartner  et  al.  201 1).  Route  planning  is 
a key  component  in  navigation  systems,  and  aims  to  compute  an  optimal  route 


394  European  Handbook  of  Crowdsourced  Geographic  Information 


Using  mobile  crowdsourcing  and  geotagged  social  media  data  395 


from  an  origin  to  a destination.  Current  route  planning  algorithms  often  fail 
to  provide  other  routes  aside  from  shortest  routes  or  fastest  routes.  However, 
research  in  geography  and  environmental  psychology  has  shown  that  humans 
may  prefer  many  other  different  route  qualities  when  choosing  which  way  to 
take’,  such  as  safety  and  attractiveness  (Golledge  1995;  Zacharias  2001). 

We  proposed  that  incorporating  peoples  affective  responses  into  route  plan- 
ning can  help  to  provide  routes  with  other  qualities  and  characteristics.  The 
basic  idea  is  to  aggregate  affective  ratings  of  similar  users  to  model/approxi- 
mate the  current  user’s  perception  of  different  street  segments.  With  this,  a 
street  network,  in  which  each  segment  is  encoded  with  a collective  rating,  can 
be  generated.  Based  on  this  kind  of  street  networks,  we  can  compute  routes 
with  different  qualities  and  characteristics,  such  as  the  most  comfortable  route 
(i.e.  route  avoiding  uncomfortable  areas)  and  the  safest  route.  For  more  details 
on  the  algorithm,  refer  to  Huang  et  al.  (2014). 

Figure  6 shows  an  example  of  the  route  computed  by  considering  the  ‘level 
of  comfort’  ratings,  comparing  to  its  shortest  counterpart.  An  online  empirical 
study  with  64  human  subjects  was  designed  to  evaluate  the  proposed  routing 


Figure  6:  The  most  comfortable  route  (green  one,  478  meters)  computed  by 
considering  people’s  affective  responses  (solely  the  ‘level  of  comfort’  ratings), 
comparing  to  its  shortest  counterpart  (grey  one,  442  meters). 
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algorithm  using  a video  task  (participants  were  asked  to  choose  the  walk  they 
prefer  to  take  after  watching  the  videos,  which  were  taped  along  the  Affec- 
tRoute  and  the  shortest  route)  and  two  map  tasks  (given  a start  and  an  end, 
participants  were  asked  to  draw  their  preferred  route).  In  the  map  task,  a chi- 
square  goodness-of-fit  test  showed  that  the  routes  computed  by  considering 
affective  responses  were  significantly  preferred  over  the  shortest  routes  [P  = 
0.041].  Similar  significant  results  were  found  in  the  map  tasks.  In  conclusion, 
considering  peoples  affective  responses  to  environments  leads  to  more  satisfy- 
ing routing  results. 

In  the  current  research,  we  aggregated  affective  ratings  from  a large  number 
of  other  users  to  provide  route  recommendations  for  the  current  user.  How- 
ever, it  is  still  unclear  how  the  routing  results  are  influenced  by  the  size/density 
of  the  affective  ratings.  More  studies  should  be  done  on  this  aspect.  On  the 
other  hand,  the  current  experiment  was  implemented  via  an  online  question- 
naire. Further  research  should  be  done  to  explore  more  realistic  evaluation  in 
more  diverse  environments. 


Challenges  and  Lesson  Learned 

While  mobile  crowdsourcing  and  social  media  data  are  promising  for  study- 
ing peoples  affective  responses  to  environments,  and  these  affective  responses 
enable  many  innovative  applications,  many  challenges  also  exist.  In  the  follow- 
ing, we  highlight  some  of  the  essential  challenges,  and  describe  the  lesson  we 
learned  from  our  current  research,  and  discuss  some  hints  for  addressing  these 
challenges. 

Motivating  participation.  Motivating  people  to  participate  is  key  to  crowd- 
sourcing projects.  This  is  even  more  important  for  crowdsourcing  peoples 
affective  responses,  which  are  subjective.  Without  enough  contributions,  it  is 
impossible  to  get  a comprehensive  understanding  about  how  environments 
are  perceived  by  different  groups  of  people.  From  our  research,  we  learned 
that  in  order  to  motivate  peoples  participation,  it  is  essential  to  provide  real 
or  perceived  ‘benefits’  for  the  contributors,  and  as  well  as  to  provide  simple 
and  intuitive  (‘easy-to-use’)  interface/tool  to  facilitate  people’s  participation  and 
contribution. 

Data  quality.  Data  quality  is  another  big  challenge  when  using  mobile 
crowdsourcing  and  social  media  data  to  study  people’s  affective  responses.  It 
is  influenced  by  the  ‘participation  inequality’  of  the  contributors.  Some  of  the 
examples  of  the  aspect  are:  contributors  might  not  be  representative  samples 
of  the  population;  contributors  might  prefer  to  contribute  at  some  places  or 
some  occasions/time,  and  avoid  some  other  places  or  occasions/time.  All  these 
factors  might  strongly  influence  the  quality  of  the  data  collected.  Furthermore, 
compared  to  other  data  (such  as  those  crowdsourced  in  OpenStreetMap), 
people’s  affective  responses  are  subjective  in  nature,  and  no  reference  data  are 
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available  for  cross-checking.  How  to  deal  with  the  data  quality  of  the  crowd- 
sourced  affective  data  is  still  an  open  research  question.  A promising  solution  is 
to  combine  self-reports,  social  media  data  analysis  and  technical  sensors  (such 
as  Galvanic  skin  response  GSR,  and  Electrocardiogram  ECG)  when  studying 
peoples  affective  responses.  With  this,  we  can  cross-check  the  data  we  collect. 

Privacy.  Peoples  affective  responses  might  contain  lots  of  sensitive  personal 
information.  Therefore,  when  studying  peoples  affective  responses,  especially 
at  an  individual  level,  it  is  important  to  develop  methods  to  address  the  privacy 
concern.  Anonymization  technique,  which  was  often  employed  in  traditional 
empirical  studies,  might  not  work  well  for  crowdsourcing  projects  and  social 
media  data  analysis.  For  example,  if  we  know  an  anonymous  user  often  con- 
tributes at  a particular  place  in  the  early  morning  (e.g.  at  6:00)  and  at  another 
place  in  the  afternoon  (e.g.  at  14:00),  we  are  potentially  able  to  re-identify  who 
the  user  is,  as  these  two  places  might  correspond  to  his/her  home  and  office 
(Huang  et  al.  2013).  More  research  should  be  done  to  develop  privacy-preserv- 
ing techniques  for  mobile  crowdsourcing  and  social  media  data  analysis. 


Summary  and  Outlook 

The  literature  has  shown  that  humans  perceive  and  evaluate  environments  affec- 
tively, and  these  affective  responses  to  environments  influence  our  daily  behav- 
iour and  decision-making  in  space.  This  paper  presented  our  recent  efforts 
towards  these  aspects.  We  illustrated  two  approaches  to  collect  and  acquire  affec- 
tive responses  from  a large  number  of  people.  Particularly,  mobile  crowdsourc- 
ing and  social  media  data  analysis  were  introduced.  This  paper  also  presented 
two  different  case  studies  to  illustrate  the  potential  applications  of  affective  data. 

Currently,  we  are  exploring  hybrid  methods  to  study  peoples  affective 
responses  to  environments,  combining  mobile  crowdsourcing  (self-reports), 
social  media  data  analysis,  and  mobile  sensing  with  sensors  like  Galvanic  Skin 
Response.  The  privacy  concerns  of  these  affective  contributions  will  also  be 
addressed.  We  are  also  applying  the  affective  data  to  other  disciplines,  such  as 
urban  planning  and  transportation. 
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Abstract 

This  contribution  concerns  ongoing  research  by  the  authors  on  the  integrated 
use  of  Social  Media  Geographic  Information  (SMGI)  and  Authoritative  Geo- 
graphic Information  (A-GI)  as  a support  in  urban  and  regional  planning. 
Advances  in  Information  and  Communication  Technologies  (ICT)  are  foster- 
ing the  production  and  the  sharing  of  georeferenced  user-generated  contents, 
namely  Volunteered  Geographic  Information  (VGI)  and  SMGI,  which  may 
complement  traditional  spatial  data  sources.  VGI  is  a voluntary  contribution 
by  users  in  order  to  collect  or  to  disseminate  geographic  knowledge,  while 
SMGI  may  be  considered  a deviation  from  VGI  nature,  due  to  the  implicit 
and  passive  mode  in  disseminating  geographic  information,  which  is  exclu- 
sively one  embedded  attribute  of  the  main  shared  information.  However,  SMGI 
may  offer  unprecedented  opportunities  to  investigate  users  needs,  opinions, 
behaviors  and  movements,  thus  representing  a potential  support  for  analysis 
and  decision-making  in  spatial  planning.  In  this  respect,  the  authors  present  an 
original  tool  called  Spatext,  which  allows  collection,  management  and  analy- 
sis of  SMGI  in  GIS  environment,  easing  the  integration  of  SMGI  with  official 
information.  Afterwards,  the  opportunities  for  spatial  planning  arising  from 
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SMGI  are  demonstrated  through  a case  study  where  this  type  of  information 
is  used  for  investigating  the  geography  of  places.  An  original  methodology  is 
developed  applying  clustering  techniques  on  the  spatial  and  the  temporal  com- 
ponents of  SMGI  collected  from  Instagram.  The  applied  methodology  enables 
the  identification  of  residential  buildings  that  are  not  mapped  in  available  offi- 
cial datasets.  The  results  demonstrate  how  SMGI  may  be  proficiently  used  to 
integrate  and  update  A-GI,  as  well  as  to  investigate  the  users  behaviors  and 
movements  in  an  urban  environment. 


Keywords 
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Introduction 

In  recent  years,  continuous  advances  in  Information  and  Communication 
Technologies  (ICT),  the  internet  and  Web  2.0  technologies  are  strengthening 
the  production,  the  sharing  and  the  access  of  user-generated  contents  among 
millions  of  users  worldwide.  The  availability  of  user-generated  contents  may 
represent  a novel  source  of  geographic  information  (Elwood,  Goodchild  & 
Sui  2012),  inasmuch  as  most  of  these  contents  may  embed  a spatial  reference, 
thanks  to  the  availability  of  GPS  and  sensors  in  handheld  devices  and  smart- 
phones, as  well  as  geo -browsers  or  location-based  social  networks,  which  are 
used  for  production. 

This  novel  type  of  geographic  information  is  commonly  referred  to  as  Vol- 
unteered Geographic  Information,  emphasizing  the  role  of  users  which  act  as 
volunteer  sensors  to  collect  and  contribute  to  this  data  (Goodchild  2007).  How- 
ever, the  information  produced  and  shared  through  social  networks,  namely 
Social  Media  Geographic  Information  (SMGI)  (Campagna  2014),  may  be  con- 
sidered a deviation  from  the  traditional  VGI  nature,  since  the  collaborative  col- 
lection and  the  diffusion  of  geographic  information  are  not  the  main  purposes 
of  users  (Stefanidis,  Crooks  & Radzikowski  2013).  Despite  an  implicit  nature 
of  SMGI  for  the  geographic  dissemination,  this  kind  of  information,  coupled 
with  traditional  VGI,  proved  to  be  useful  in  different  application  domains  such 
as  environmental  monitoring,  crisis  management  (Poser  & Dransch  2010),  as 
well  as  urban  planning  (Frias-Martinez  et  al.  2012).  Indeed,  VGI  and  SMGI 
may  represent  a valuable  complement  to  traditional  official  information,  sup- 
plying insights  on  users’  perceptions  and  needs,  opinions  on  places,  as  well  as 
information  about  daily  events,  in  (near)  real-time,  so  potentially  contributing 
to  faster  decisions. 

In  the  urban  and  regional  planning  domain,  VGI  and  SMGI  may  play  an 
important  role  to  support  (1)  analysis,  (2)  design  and  (3)  decision-making, 
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fostering  innovations  in  planning  methodologies,  inasmuch  the  majority  of 
information  required  and  used  in  practices  is  mainly  spatial.  As  a matter  of  fact, 
this  innovative  wealth  of  GI  may  integrate  the  current  availability  of  official 
digital  information  with  pluralist  knowledge  from  local  communities  that  is 
usually  neglected  in  practice,  paving  the  way  for  innovative  analytic  scenarios. 

Currently  in  Europe,  a wealth  of  official  digital  geographic  information  was 
made  available  since  2007  by  the  implementation  of  the  Directive  2007/02/CE 
(INSPIRE),  which  fostered  developments  in  Spatial  Data  Infrastructures  (SDI) 
among  European  member  states.  This  process  is  favoring  the  access  and  the  reuse 
of  available  official  digital  information,  namely  A-GI,  produced  by  Public  Author- 
ities. This  way,  planners,  analysts  and  professionals  may  access  A-GI  according 
to  common  technology,  data  formats,  and  policy  standards.  The  integration  of 
available  official  information  with  SMGI  may  further  improve  this  potential, 
enriching  the  official  datasets  with  information  regarding  not  only  geographic 
facts  but  also  insights  on  peoples  perceptions  and  feelings  both  in  space  and  time. 

Nevertheless,  the  opportunities  for  the  use  of  SMGI  in  spatial  planning  as 
affordable  and  potentially  boundless  sources  of  information  have  to  deal  with 
diverse  challenges  related  to  data  management,  data  quality  and  data  analysis. 
Indeed,  the  traditional  spatial  analysis  methodologies  and  techniques  may  not 
be  fully  suitable  to  tackle  the  complexity  of  this  information  that  exhibits  Big 
Data  nature  for  its  modes  of  production  and  consumption  (Caverlee  2010). 
Hence,  in  spite  of  several  applications,  related  to  different  application  domains 
and  built  upon  the  integration  of  SMGI  and  VGI  in  recent  years,  the  lack  of 
common  methods  to  deal  with  these  issues  still  requires  the  development  of 
novel  analytical  frameworks  in  order  to  fully  exploit  the  SMGI  potential  for 
analysis,  design  and  decision  making. 

In  the  light  of  the  above  premises,  this  contribution  presents  a review  of  the 
authors  research  results  on  the  integration  and  use  of  A-GI  and  SMGI,  aim- 
ing at  developing  valuable  tools  and  methodologies  for  spatial  planning.  The 
remainder  of  the  contribution  is  articulated  as  follows.  The  next  section  briefly 
discusses  the  distinctive  features  of  SMGI,  focusing  on  its  main  issues  and 
opportunities  for  analysis.  Section  3 introduces  an  original  tool,  developed  by 
the  authors  and  called  Spatext,  which  enables  the  seamless  collection,  manage- 
ment and  analysis  of  SMGI  from  multiple  social  media  in  a GIS  environment. 
In  section  4 a novel  approach  to  SMGI  analytics  is  proposed  concerning  a case 
study  related  to  urban  planning.  Finally,  section  5 draws  conclusions  and  sum- 
marizes the  discussion  about  the  opportunities  and  the  open  issues  of  SMGI  for 
urban  and  regional  planning. 


Issues  and  opportunities  of  SMGI 

The  wealth  of  georeferenced  user-generated  contents  regarding  facts,  opinions, 
and  concerns  of  users,  freely  accessible  through  the  internet  by  social  media 
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Application  Programming  Interfaces  (APIs),  may  affect  current  practices  in 
urban  and  regional  planning,  giving  opportunities  for  real-time  monitoring  of 
needs,  thoughts  and  trends  of  local  communities.  However,  the  current  public 
accessibility  to  SMGI  is  rather  limited  (Lazer  et  al.  2009),  and  common  meth- 
ods to  manage,  process  and  exploit  these  resources  in  practices  still  lack.  The 
main  hurdles  limiting  a wider  use  of  SMGI  may  be  found  both  in  the  shortage 
of  user-friendly  tools  to  collect  and  to  manage  huge  data  volumes  and  in  the 
particular  data  structure  of  this  information,  which  is  burdensome  to  analyze 
by  traditional  methods.  While  the  former  challenge  is  starting  to  be  addressed 
by  new  approaches  typical  of  computational  social  science,  an  emerging  field 
that  aims  to  develop  methodologies  to  address  the  complexity  of  Big  Data 
(Lazer  et  al.  ibidem),  the  latter  challenge  might  require  a tuning  of  analytical 
methodologies  to  deal  with  the  several  facets  of  SMGI. 

First  of  all,  although  SMGI  may  be  potentially  available  through  the  internet 
from  any  social  media  APIs,  each  platform  features  specific  characteristics  for 
contents  production  and  sharing;  hence  SMGI  from  different  social  media  may 
embed  different  attributes,  causing  difficulties  for  data  integration  and  analysis. 
Moreover,  SMGI  is  usually  broadcasted  through  the  internet  by  coupling  alpha- 
numeric data  and  multimedia  clips,  which  complicate  the  analysis  by  means  of 
traditional  query  languages.  Secondly,  SMGI,  as  user-generated  contents  with 
an  associated  geospatial  component,  combines  the  spatial  and  the  temporal 
dimension  of  geographic  information  with  a third  dimension,  namely  the  user 
itself,  therefore  extending  the  range  of  available  analytical  methods  with  fur- 
ther opportunities,  such  as  users’  behavioral  analysis,  users’  interests  investiga- 
tion, land  segmentation  and  potentially  any  analysis  based  on  space,  time  and 
user  (Campagna  ibidem).  These  analytical  methods  may  represent  a novel  way 
to  investigate  facets  of  the  social  and  cultural  habits  of  local  communities,  but 
their  implementation  may  represent  a challenge,  which  requires  the  integration 
of  traditional  spatial  analysis  methods  with  expertise  and  contributions  from 
various  disciplines  such  as  social  sciences,  linguistic,  psychology  and  computer 
science  (Stefanidis,  Crooks  & Radzikowski  ibidem). 

The  requirements  for  new  analytical  tools  to  deal  with  SMGI,  and  the  oppor- 
tunities resulting  from  the  inherent  nature  of  this  information,  guide  the  devel- 
opment of  an  original  user-friendly  tool,  called  Spatext,  which  eases  the  collec- 
tion of  information  from  multiple  social  networks  and  the  integration  of  the 
data  in  a GIS  environment  for  analysis. 


Spatext:  the  SPAtial-TEmporal-teXTual  Suite 

Spatext  is  an  add-in  for  the  commercial  software  ESRI  ArcMap©  implemented 
in  Python  2.7.  It  enables  the  contextual  social  media  data  collection,  man- 
agement, geo co ding,  as  well  as  the  spatial,  temporal  and  textual  analysis  of 
SMGI.  This  SMGI  Analytics  suite  includes  a number  of  tools,  which  can  be 
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used  mainly  to  (1)  retrieve  social  media  data  from  social  media  (including 
Twitter,  YouTube,  Wikimapia,  Instagram,  Instagram  Places,  Foursquare  and 
Panoramio);  (2)  geocode  or  georeference  data;  and  (3)  carry  out  integrated 
spatial,  temporal  and  textual  analyses.  In  addition,  the  number  of  analytical 
methods  available  in  the  tool  is  steadily  increasing  in  order  to  include  several 
clustering  algorithms  to  enable  user  profiling,  user  movement  analysis,  user 
behavioral  analysis  and  land  use  detection,  to  name  a few.  Indeed,  the  collec- 
tion, management  and  geocoding  functionalities  may  turn  any  social  media 
content  into  a workable  SMGI  dataset,  which  may  then  be  directly  integrated 
with  other  spatial  data  and  analyzed  in  a GIS  environment  with  off-the-shelf 
instruments. 

Spatext  takes  advantage  of  the  available  social  media  APIs  to  perform  que- 
ries directly  from  the  GIS  interface,  enabling  the  collection  of  multimedia 
information  regarding  different  topics,  time  periods  and  geographic  areas. 
This  way,  the  extension  of  traditional  GIS  tools  with  Spatext  tools  may  ease 
the  integration  of  SMGI  with  authoritative  data,  in  order  to  support  analy- 
sis, design  and  decision-making  in  urban  and  regional  planning.  The  tools 
included  in  Spatext  are  developed  in  order  to  deal  with  the  aforementioned 
issues  regarding  the  access,  management  and  analysis  of  ‘big  data  and  can 
be  categorized  in  three  different  classes  according  to  the  specific  function: 
(1)  data  collection,  (2)  data  management  and  (3)  data  analysis.  The  first  class 
includes  user-friendly  tools  that  enable  the  harvesting  of  information  from 
several  social  networks  through  spatial,  temporal  or  textual  queries.  These 
tools  can  facilitate  the  direct  access  to  social  networks  APIs  avoiding  program- 
ming efforts.  The  second  class  provides  tools  developed  to  ease  the  manage- 
ment, the  integration  and  the  successive  analysis  in  GIS  environment  of  SMGI 
extracted  from  different  sources.  These  tools  aim  to  limit  the  issues  regarding 
the  management  and  conversion  of  SMGI  originated  from  different  sources, 
which  may  present  different  data  structures  and  information.  Finally,  the 
third  class  contains  tools  designed  for  analyzing  the  spatial,  temporal  and  user 
dimensions  of  this  information,  as  well  as,  for  enabling  the  investigation  of 
embedded  textual  contents.  At  the  time  being,  the  Spatext  suite  is  not  available 
for  download  due  to  minor  technical  revisions  ongoing  on  APIs  access.  An 
overview  of  the  Spatext  functionalities  is  presented  in  Table  1,  where  the  main 
tools  are  classified  and  briefly  described  according  to  the  specific  class,  while 
the  Spatext  architecture  is  shown  in  Figure  1. 

In  the  next  section,  functionalities  of  the  Spatext  tool  for  SMGI  analytics  are 
demonstrated  through  a case  study  related  to  urban  planning  in  the  municipal- 
ity of  Iglesias  in  Sardinia,  Italy.  The  case  study  proposes  the  analysis  of  Insta- 
gram SMGI  coupled  with  A-GI  from  Sardinian  Spatial  Data  Infrastructures 
(i.e.  Sardegna  Geoportale  http://www.sardegnageoportale.it)  to  investigate  the 
geography  of  the  municipality  and  to  debate  the  potential  opportunities  emerg- 
ing from  the  integration  of  implicit  experiential  knowledge  with  official  infor- 
mation for  urban  planning  practices. 
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SPATEXT  TOOLS 

CLASS  1 - 'DATA  COLLECTION’ 

query  parameters 

space 

time 

keyword 

Instagram  extractor 

Extracts  Instagram  SMGI  to 
shapefile 

S 

S 

YouTube  extractor 

Extracts  YouTube  SMGI  to 
shapefile 

Y 

Y 

Instagram  Places 
extractor 

Extracts  Instagram  Places 
SMGI  to  shapefile 

V 

Twitter  extractor 

Extracts  Twitter  SMGI  to 
shapefile 

Y 

V 

WikiMapia  extractor 

Extracts  Wikimapia  SMGI 
to  shapefile 

Y 

Foursquare  extractor 

Extracts  Foursquare  SMGI 
to  shapefile 

V 

Panoramio  extractor 

Extracts  Panoramio  SMGI 
to  shapefile 

CLASS  2 - ‘ DATA  MANAGEMENT’ 

function  activation 

manual 

automatic 

Geocoding  tools 

Geocode  places/addresses 
from  attribute  strings 

Y 

Conversion  tools 

Convert  SMGI  attributes  to 
table 

V 

Decomposition  tools 

Slice  SMGI  dataset  in  multiple 
datasets 

V 

CLASS  3 - DATA  ANALYSIS’ 

Temporal  Reference 

Enriches  SMGI  dataset  by  adding  information  on 
temporal  periods  (season,  month,  day,  hour)  for  further 
investigation 

Temporal  Trend 

Analyzes  the  temporal  trend  of  SMGI  contribution  pro- 
viding a graphic  report 

Clustering  functions 
(DBSCAN) 

Detect  clusters  of  high  density  by  running  the  Density- 
Based  Spatial  Clustering  of  Applications  with  Noise 
algorithm  on  SMGI  dataset 

Textual  tag- cloud 

Performs  a simple  tag- cloud  analysis  on  textual  contents 
of  SMGI  dataset  to  detect  topic  of  interest 

Table  1:  Main  functionalities  available  in  Spatext  tool. 
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Figure  1:  Spatext  architecture  design. 

Instagram  SMGI  analytics:  an  application  in  urban  planning 

In  this  section,  an  application  of  SMGI  analytics  is  proposed  through  the  analysis 
of  Instagram  contents  in  the  urban  environment  of  the  Iglesias  municipality  in 
Sardinia,  Italy.  Nowadays,  Instagram  is  one  of  the  most  popular  online  social  net- 
works worldwide,  and  it  enables  users  to  take,  upload,  edit  and  share  photos  with 
other  members  of  the  service  through  the  platform  itself,  or  other  social  media 
such  as  Facebook,  Twitter,  Tumblr,  Foursquare  and  Flickr.  Approximately  20  per- 
cent of  the  internet  users  aged  16  to  64  have  an  account  on  the  service,  and  the 
trend  is  growing  over  last  years.  In  addition,  demographics  of  active  Instagram 
users  (GlobalWeblndex  2014)  show  a balanced  percentage  between  male  users 
(51%)  and  female  users  (49%),  with  a high  percentage  (41%)  of  users  aged  16  to 
24  that  prevail  over  users  aged  25  to  34  (35%),  35  to  44  (17%),  45  to  54  (6%)  and 
55  to  64  (2%).  Statistics  on  the  service  stress  also  how  a major  part  of  active  users 
(56%)  appear  to  be  into  the  middle  quartile  (33%)  or  top  quartile  (23%)  of  income. 
Among  the  features  offered  by  Instagram,  the  geotag  allows  users  to  embed  lati- 
tude and  longitude  of  the  place  with  the  taken  photos,  therefore  allowing  to  share 
the  contents  and  the  geographic  reference  through  the  internet  according  to  own 
privacy  settings.  This  capability  plays  a central  role  in  considering  Instagram  con- 
tents as  SMGI  and  permits  the  development  of  analysis  to  investigate  spatial  and 
temporal  patterns  within  any  geographic  area  where  the  service  is  available. 

The  case  study  concerning  the  Iglesias  municipality  (Italy)  took  advantage  of 
the  Instagram  SMGI  for  a twofold  purpose:  (1)  to  explore  the  geography  of  the 
place  through  spatial  and  temporal  patterns  of  the  contributions,  investigating 
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trends  and  areas  of  interest  within  the  municipality,  and  (2)  to  identify  and  clas- 
sify SMGI  clusters,  relying  on  the  inherent  spatial  and  temporal  components,  as 
well  as  by  means  of  the  integration  with  A-GI,  in  order  to  detect  potential  miss- 
ing buildings  in  official  datasets.  The  operational  application  of  SMGI  analytics 
on  the  case  study  of  Iglesias  municipality  was  carried  out  according  to  the  fol- 
lowing three  main  steps:  (1)  data  collection,  (2)  analysis  of  spatial  and  temporal 
components,  and  (3)  detection  and  classification  of  SMGI  clusters,  as  explained 
in  detail  in  the  remainder  of  the  contribution. 


Data  collection 

The  data  collection  of  SMGI  from  Instagram  was  conducted  through  the  Spa- 
text  Instagram  extractor  tool  by  setting  the  spatial  query  parameter  on  the 
municipality  of  Iglesias  and  the  temporal  query  parameter  on  a one  year  period 
(from  1 August  2013  to  1 August  2014).  The  extraction  resulted  in  the  col- 
lection of  a one  year  sample  of  approximately  14,000  geotagged  photos  from 
1.243  users  for  the  study  area.  The  tool  automatically  generated  a point  feature 
dataset,  georeferencing  each  photo  according  to  the  geographic  reference  (lati- 
tude and  longitude)  embedded  in  the  spatial  metadata  of  the  content,  namely 
the  geotag.  Commonly,  the  geotag  refers  the  GPS  position  of  camera  when  the 
photo  was  taken;  however,  issues  in  connectivity  may  lead  toward  the  lack  of 
this  information.  In  these  cases,  the  Instagram  service  sets  the  geographic  coor- 
dinates of  the  contents  using  the  users  position  during  the  upload.  In  addition 
to  the  geographic  coordinates,  the  dataset  includes  several  attributes,  such  as 
name  of  the  place,  if  set  by  the  user  during  upload,  user  name,  user  id,  user 
picture  URL,  media  URL,  date  of  creation,  number  of  comments,  number  of 
likes,  tags  and  captions.  These  attributes  are  made  available  for  any  Instagram 
content  if  the  user’s  privacy  settings  are  public,  offering  opportunities  for  the 
development  of  several  analysis  in  combination  with  other  spatial  data  lay- 
ers. Even  though  these  pieces  of  information  are  publicly  available,  data  were 
anonymized  for  privacy  issues  before  any  storage  or  processing  for  the  study. 

An  exploratory  analysis  of  the  SMGI  dataset  showed  a mean  value  of  11.22 
photo/user,  a modal  value  of  1.0  photo/user  and  a median  value  of  2.0  photo/ 
user.  Indeed,  the  39.82%  of  users  contributed  with  only  1 photo  per  year,  the 
32.74%  contributed  with  5 photo  or  more,  while  only  the  4.34%  of  users  posted 
50  photos  or  more.  Despite  a different  degree  of  participation  by  users,  the 
dataset  was  investigated  in  order  to  identify  potential  commonalities  among 
contributions  in  terms  of  areas  of  interest  and  urban  dynamics. 

Analysis  of  spatial  and  temporal  components 

After  the  data  collection,  the  spatial  and  temporal  components  of  the  SMGI 
dataset  were  investigated  directly  in  GIS  environment,  in  order  to  explore 
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potential  patterns  of  interest  in  the  area  and  local  community  dynamics.  At  this 
stage,  the  SMGI  dataset  was  integrated  with  several  official  datasets  from  the 
regional  spatial  data  infrastructure  of  Sardinia  related  to  the  Iglesias  municipal- 
ity such  as  settlements,  roads  network  and  buildings. 

A simple  investigation  of  the  dataset  spatial  distribution  showed  a high  con- 
centration of  the  placemarks  within  the  built  environment,  with  approximately 
the  89%  of  the  contents  taken  in  residential  or  commercial  and  service  areas. 
This  value  may  depict  the  users’  preference  to  employ  the  Instagram  service  in 
situations  strictly  related  to  their  daily  life  within  the  city  and  might  be  consid- 
ered a good  starting  point  to  investigate  the  dynamics  in  the  municipality.  The 
spatial  distribution  of  the  SMGI  dataset  is  shown  in  Figure  2. 

With  the  above  considerations  in  mind,  the  temporal  component  of  the  SMGI 
dataset  was  investigated  for  different  periods  by  searching  potential  peaks  of 
interest,  trends  and  dissimilarities  in  the  use  of  Instagram  by  the  users  in  Igle- 
sias. The  temporal  analysis  was  performed  investigating  seasons,  months,  days 
of  the  week  and  hours  of  the  day,  disclosing  interesting  patterns.  The  results  of 
temporal  analysis  showed  how  SMGI  was  increasingly  produced  and  shared  by 
users  during  the  spring  (30.9%)  and  summer  (33.3%)  in  opposition  to  winter 
(19.1%)  and  autumn  (16.7%);  and  this  phenomenon  was  also  evident  in  month 
distribution  where  July  presented  the  highest  percentage  of  produced  contents 
(13%)  and  November  the  lowest  one  (5%). 

The  analysis  of  daily  distribution  provided  more  balanced  results,  with  a 
slightly  higher  percentage  of  contents  produced  during  weekends  (Saturday  and 
Sunday).  Finally,  the  analysis  of  daily  hours  trend  allowed  identifying  two  main 
peaks  of  interest  for  both  workdays  (Monday  to  Friday)  and  weekends  (Saturday 
to  Sunday).  The  peaks  were  identified  during  the  periods  14:00-15:00  and  21:00- 
22:00  for  workdays,  and  the  periods  14:00-15:00  and  20:00-21:00  for  weekends, 
probably  identifying  meals  or  pause  times.  In  contrast,  the  period  05:00-06:00 
showed  the  lowest  percentage  of  produced  contents  both  for  the  workdays  and 
the  weekends.  In  spite  of  similar  temporal  peaks,  the  workdays  and  weekends 
trends  exposed  a few  differences,  which  might  be  considered  to  be  a descriptor 
of  the  typical  cultural  behaviors  of  inhabitants  or  a sort  of  cultural  footprint  of 
the  place.  This  assumption  may  be  corroborated  by  the  results  of  a similar  study 
conducted  on  Instagram  datasets  by  Silva  et  al.  (2013),  which  demonstrated  how 
workdays  and  weekends  temporal  patterns  were  similar  for  cities  of  the  same 
country,  but  showed  major  differences  among  cities  in  different  countries.  The 
results  of  temporal  analysis  for  the  different  periods  are  provided  in  Figure  3. 


Detection  and  classification  of  SMGI  clusters 

The  results  of  spatial  and  temporal  investigations  led  towards  the  development 
of  further  analysis  to  investigate  the  geography  and  the  urban  dynamics  of  the 
municipality.  Especially  the  major  density  of  SMGI  in  the  built  environment 
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Figure  2:  Spatial  distribution  of  Instagram  SMGI  in  Iglesias  municipality. 
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Figure  3:  Temporal  distribution  of  Instagram  SMGI  dataset:  (A)  season,  (B)  month,  (C)  day  of  the  week,  (D)  hours  trend  during 
weekends  and  workdays. 
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fostered  the  development  of  analytical  methods  to  identify,  classify  and  inter- 
pret the  users’  interest  toward  certain  specific  spaces.  For  this  purpose,  the 
Density-Based  Spatial  Clustering  of  Applications  with  Noise  algorithm  or 
DBSCAN  (Ester  et  al.  1996)  and  a slightly  modified  version  called  Feature- 
Based  DBSCAN  (FB-DBSCAN)  were  integrated  in  Spatext,  and  were  used  to 
compute  clusters  based  on  the  spatial  density  of  points. 

The  DBSCAN  algorithm  offers  major  advantages  with  respect  to  other  clus- 
tering algorithms;  firstly  it  is  not  necessary  to  know  a priori  the  number  of 
clusters,  which  also  may  differ  in  size  and  shape.  Secondly,  it  works  using  two 
parameters  exclusively:  the  epsilon  (eps)  that  is  the  maximum  threshold  dis- 
tance for  including  points  in  the  same  cluster,  and  the  minimum  number  of 
points  (min_pts)  that  is  required  to  define  a cluster.  In  the  study,  the  goal  of 
the  clustering  analysis  was  the  identification  of  the  places  that  attracted  the 
interest  of  the  local  community,  which  may  be  measured  in  terms  of  high 
density  of  contributions.  Nevertheless,  operatively  there  was  no  opportunity 
to  establish  the  preferable  value  of  eps  and  min_pts  before  the  computation, 
therefore  the  DBSCAN  tool  was  applied  iteratively  on  the  SMGI  dataset  for 
different  measures  of  the  parameters  in  order  to  evaluate  different  results  of  the 
clustering.  The  assessment  of  clustering  results  led  toward  the  identification  of 
the  following  values,  which  proved  to  be  the  most  suitable  for  the  purpose  of 
the  study:  eps  = 20  meters  and  min_pts  = 5.  Indeed,  this  eps  value,  or  thresh- 
old distance,  was  able  to  cover  the  dimension  of  a medium- sized  fabric,  while 
the  min_pts  value  was  set  to  5 as  a compromise  value  to  avoid  false  positive 
in  clusters  detection  and,  at  the  same  time,  to  prevent  the  dismissal  of  clusters 
with  a modest  participation  of  users.  The  results  of  clustering  analysis  with  the 
above  set  of  values  enabled  the  identification  of  290  clusters  within  the  urban 
area  of  Iglesias,  with  a major  concentration  near  the  city  center.  In  addition, 
two  large  clusters  with  an  area  greater  than  50,000  square  meters  emerged  from 
the  analysis,  identifying  the  areas  attracting  the  highest  interest  by  users  within 
the  urban  context.  These  areas  concerned  both  the  historic  centre  of  Iglesias 
and  several  service  and  public  space  areas.  A closer  look  to  the  clusters  showed 
that  the  top  cluster  contained  the  historic  Cathedral  of  Santa  Chiara,  the  main 
avenue  for  leisure  and  night  life  of  the  municipality,  two  of  the  main  squares  of 
Iglesias,  as  well  as  the  train  station  area.  At  the  same  time,  the  bottom  cluster 
contained  several  areas  related  to  medical  services,  leisure,  nightlife  as  well  as 
the  public  park  of  the  municipality. 

Along  the  same  vein,  the  FB-DBSCAN  tool  was  used  on  the  SMGI  data- 
set in  order  to  detect  the  places  of  major  interest  for  each  user.  In  fact,  the 
FB-DBSCAN  algorithm  processes  the  dataset  after  performing  a selection  for 
attribute  on  the  sample,  in  this  case  the  users.  This  way,  the  algorithm  computes 
clusters  by  processing  only  points  related  to  a specific  user  for  each  iteration, 
offering  opportunities  to  develop  more  specific  analysis  on  the  users’  behav- 
ior. The  analysis  through  FB-DBSCAN  with  the  parameters  eps  = 20  meters 
and  min_pts  = 5 identified  368  clusters  concerning  266  users.  In  this  case  the 
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number  of  identified  clusters  was  higher  than  the  one  in  the  previous  analysis, 
but  the  clusters  sizes  were  notably  smaller,  identifying  specific  places  or  fab- 
rics within  the  municipality.  The  results  of  the  clustering  analysis  performed  by 
DBSCAN  and  FB-DBSCAN  are  shown  in  Figure  4. 

Each  cluster  identified  through  the  FB-DBSCAN  tool  belonged  to  the  contri- 
butions of  a single  user,  and  could  be  considered  representative  of  a specific  use 
regarding  residence,  work  or  leisure  activities.  The  current  use  of  a cluster  may 
be  discovered  by  analyzing  several  parameters  related  to  spatial  and  temporal 
characteristics,  as  well  as  by  integrating  further  spatial  information.  The  aim  of 
the  study  was  the  identification  of  not  mapped  buildings  in  the  official  informa- 
tion; therefore  the  latest  official  buildings  dataset  from  the  Regional  SDI  was 
integrated.  This  official  dataset  was  selected  in  order  to  check  the  consistency 
of  the  clusters  location  with  the  urban  fabrics  and  to  ease  the  identification  of 
suitable  parameters  to  detect  residential  clusters.  As  a matter  of  fact,  the  clusters 
related  to  a specific  land  use,  in  this  case  residential  use,  may  expose  similar  pat- 
terns for  certain  characteristics  such  as  number  of  intersections  among  clusters, 
temporal  span  among  contributions,  number  of  contributions  and  density  of 
contributions,  to  name  few,  paving  the  way  to  the  identification  of  common  pat- 
terns for  classification.  In  the  study,  six  different  parameters  were  selected  with 
regards  to  the  cluster  itself  and  to  the  contributions,  as  described  in  Table  2. 

The  values  of  the  six  parameters  were  estimated  for  each  cluster,  and  several 
combinations  of  the  values  were  iteratively  evaluated  to  identify  exclusively  the 
residential  clusters.  The  following  set  of  values  resulted  as  the  most  suitable  to 
classify  a cluster  as  residential  in  the  study  area:  Cluster  Centroid  and  Contri- 
butions Centroid  had  to  be  1 (yes),  while  Number  of  Contributions  and  Time 
Span  Among  Photos  had  to  present  the  highest  values  among  clusters  of  the 
same  user,  or  the  values  had  to  be  higher  than  10  and  30,  respectively.  Finally, 
Cluster  Intersections  had  to  be  equal  or  lower  than  2,  while  Cluster  Density 
had  to  be  higher  than  4.  The  above  parameters  allowed  the  identification  of 
47  residential  clusters,  which  were  confirmed  by  an  overlay  analysis  with  satel- 
lite imagery  in  GIS  environment.  Furthermore,  the  used  parameters  avoided 
potential  biases  caused  by  temporary  phenomena  such  as  massive  tourists’ 
presence  or  extremely  popular  events  thanks  to  the  threshold  interval  set  for 
the  parameter  Time  Span  Among  Photos , which  considered  only  time  periods 
equal  or  higher  than  30  days  to  classify  a cluster  as  residential.  Afterwards,  the 
same  set  of  parameters  was  used  to  identify  potential  missing  buildings  in  the 
official  dataset  by  setting  to  0 (no)  the  values  of  Cluster  Centroid  and  Contribu- 
tions Centroid , while  leaving  unchanged  the  other  parameters  values.  Indeed, 
the  values  of  Number  of  Contributions,  Cluster  Intersections , Time  Span  Among 
Photos  and  Cluster  Density  were  considered  as  a sort  of  residential  parcels  foot- 
print among  clusters  and  were  used  for  the  investigation. 

The  analysis  identified  40  clusters,  which  were  then  visually  assessed  through 
satellite  imagery  to  confirm  the  presence  of  not  mapped  buildings  in  A-GI.  The 
visual  assessment  allowed  the  detection  of  9 not  mapped  buildings;  at  the  same 
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Figure  4:  Clustering  analysis  of  SMGI  dataset:  DBSCAN  results  (left),  FB-DBSCAN  results  (right). 
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Parameters 

Description 

Units  of  measure 

Cluster  Centroid 

The  overlap  of  the  cluster  s centroid  with 
an  official  building  footprint  is  estimated 

Boolean 

Contributions 

Centroid 

The  overlap  of  the  cluster  s contributions 
centroid  with  an  official  building  footprint 
is  estimated 

Boolean 

Number  of 
Contributions 

The  total  number  of  contributions 
contained  in  the  cluster  is  estimated 

Number  of 
contributions 

Cluster 

Intersections 

The  total  number  of  intersections  between 
the  cluster  s shape  with  other  clusters 

Number  of 

intersections 

Cluster  Density 

The  ratio  between  the  cluster  s area  and  the 
number  of  contained  contributions 

Square  meters 

Time  Span  Among 
Photos 

The  time  passed  between  the  first 
contribution  and  the  last  one  in  the  cluster 

Days 

Table  2:  Parameters  used  to  identify  residential  clusters. 


time  the  other  31  clusters  were  confirmed  as  residential  areas,  but  the  build- 
ings were  already  mapped  in  A-GI.  This  issue  can  be  explained  by  the  lack  of 
tolerance  during  the  estimation  of  Cluster  Centroid  and  Contributions  Centroid 
values  with  the  official  buildings  dataset.  An  example  of  the  analysis  results  is 
provided  in  Figure  5,  where  six  different  clusters  (i.e.  A,  B,  C,  D,  E and  F),  their 
barycenter,  the  existing  buildings  footprints  from  the  official  dataset,  the  main 
roads  network,  and  the  Instagram  SMGI  dataset  are  shown. 

In  this  example,  the  manual  investigation  through  the  Google  Maps  satellite 
image  enabled  the  detection  of  two  buildings  which  were  not  mapped  in  official 
dataset,  namely  cluster  B and  D.  At  the  same  time,  the  visual  assessment  con- 
firmed the  building  presence  in  cluster  A,  C,  E and  F. 

This  example  demonstrates  the  potentialities  of  Instagram  SMGI  to  elicit 
information  related  to  geography  of  places,  and  also  shows  how  this  informa- 
tion may  be  potentially  used  as  a support  for  the  update  and  the  integration  of 
official  datasets. 


Conclusion 

The  results  of  the  proposed  study  offer  an  overview  of  potential  uses  of  SMGI 
for  integrating  and  updating  the  available  official  information,  as  well  as  for 
obtaining  information  about  the  physical  geography  of  places  in  the  domain 
of  spatial  planning  analysis.  Currently,  the  wealth  of  information  enclosed  in 
SMGI  may  be  used  to  investigate  the  concerns  and  the  attentions  of  people 
toward  places  and  also  their  behaviors  and  movements  in  space  and  time. 
These  opportunities  arise  from  the  increasing  availability  of  SMGI  produced 
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Figure  5:  Results  of  clusters  investigation  in  Iglesias  municipality. 


through  several  social  networks,  which  may  be  considered  as  affordable  and 
potentially  boundless  sources  of  near  real-time  information  about  any  topic. 
Hence,  the  collection  of  SMGI  and  the  integration  with  official  dataset  may 
represent  a valid  support  for  analysis,  design  and  decision-making,  offering  a 
pluralist  perspective  from  local  communities  to  enhance  methodologies  and 
practices  in  urban  and  regional  planning. 

Nevertheless,  despite  the  several  opportunities  for  analysis,  it  is  important 
to  be  aware  that  the  SMGI  datasets  should  be  not  considered  representative  of 
the  whole  local  community.  The  social  network  services  are  used  differently  by 
diverse  segments  of  the  population,  that  are  the  users  of  the  service  itself,  and 
the  preferences  and  cultural  biases  of  these  groups  highly  affect  the  phenomena 
under  observation  in  SMGI,  raising  issues  about  the  data  representativeness. 
Furthermore,  as  for  the  subject  of  the  proposed  case  study,  namely  the  study  of 
geography  of  a place  by  Instagram,  the  social  platforms  used  to  collect  SMGI 
suffer  of  a different  degree  of  penetration  worldwide  according  to  users’  pref- 
erence, limiting  de  facto  the  analysis  opportunities  only  for  areas  where  the 
services  are  available.  In  the  future,  a wider  diffusion  may  occur  to  this  respect 
as  suggested  by  the  current  social  network  growth  trends,  but  definitely  for 
the  time  being  both  SMGI  and  A-GI  show  different  diffusion  rates  in  diverse 
regions  and  countries  worldwide.  Therefore,  different  analytical  approaches 
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based  on  several  platforms  may  be  required  in  order  to  investigate  the  local 
contexts  appropriately.  Much  more  research  is  needed  to  assess  the  full  poten- 
tial of  SMGI  and  several  issues  should  be  addressed  regarding  data  quality 
and  representativeness;  however,  current  results  disclose  challenging  research 
opportunities,  which  may  lead  to  advances  in  spatial  planning  methodologies 
and  practices,  as  well  as  in  other  domains. 
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Abstract 

Are  Land  Owners  able  to  participate  in  official  Cadastral  Surveys?  Can  the  offi- 
cial procedures  be  modified  in  order  for  crowdsourcing  techniques  to  be  incor- 
porated in  them?  How  many  different  stakeholders  can  actively  get  involved 
and  which  is  the  optimum  workflow  that  could  be  adopted?  This  chapter 
addresses  the  main  question  of  whether  the  spatial  and  attribute  data  that  is 
collected  by  volunteers  - land  owners  - can  be  used  in  official  Land  Adminis- 
tration Systems  (LAS)  and  explores  the  potential  introduction  of  crowdsourc- 
ing techniques  into  the  official  cadastral  surveys  as  a simplified  and  transparent 
procedure  with  the  aid  of  citizens. 

As  the  title  reveals,  this  chapter  aims  to  propose  a crowdsourcing  cadastral 
model  as  an  alternative  option  to  the  official  cadastral  procedures.  The  research 
investigates  a voluntary  model  and  documents  it  in  terms  of  participants  and 
structure  according  to  various  international  case  studies  that  were  analyzed  in 
depth  within  the  research  that  was  carried  out  by  Haklay  et  al.  (2014).  The  main 
lessons  that  were  collected  by  each  separate  crowdsourced  case  study  were  used 
as  a guideline  for  the  suggested  cadastral  model.  The  opportunities  and  the 
weaknesses  are  isolated  and  explored  in  terms  of  a cadastral  survey.  The  main 
advantage  of  the  adopted  case  studies  is  the  opportunity  for  a simulation  of 
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real  circumstances  in  cadastral  surveys  and  not  a simplistic  logic.  The  main 
innovation  is  focused  on  the  design  of  the  model  in  an  a priori  approach  with 
well-defined  and  already  tested  successful  case  studies. 

The  model  is  clarified  in  technical  and  societal  aspects  while  it  sheds  light 
on  the  proposed  process.  The  workflow,  the  stakeholders  and  the  adopted 
main  lessons  of  the  crowdsourced  case  studies  are  the  key  components  that  are 
explored  and  analyzed  in  this  chapter. 

Keywords 

Volunteered  Geographic  Information,  Crowdsourcing,  Cadastre,  Land  Admin- 
istration & Management,  Spatial  Data  Infrastructure. 

Introduction 

The  great  revolution  that  has  taken  place  in  Geographic  Information  Science 
(GIS)  has  led  to  dramatic  changes  in  the  source,  use  and  manipulation  of  spatial 
information  during  the  last  decade.  Seeger  (2008)  and  others  name  it  ‘the  digi- 
tal spatial  data  which  is  collected  and  edited  not  by  data  producers  but  by  citi- 
zens who  are  not  experts  but  willing  to  disseminate  their  spatial  knowledge  and 
observations  without  any  special  invitation.  Blaut  et  al.  (2003)  had  earlier  noted 
the  specific  predisposition  of  all  human  beings  to  map  by  underlining  that  all 
people  have  natural  mapping  abilities.  According  to  Kingsley  (2007),  civil  society 
shares  the  same  goals  and  has  created  a non-hierarchical  network  of  self-organ- 
ized individuals  who  participate  in  it.  In  fact  he  predicted  the  evolution  of  map- 
ping, the  involvement  of  amateurs  with  the  aid  of  web  tools  and  the  alteration  of 
role  distribution  between  mapping  agencies  and  users  (Budhathoki  et  al.  2008). 
Three  principle  definitions  were  introduced  to  give  a general  title  to  the  phe- 
nomenon; Neogeography , Volunteered  Geographic  Information  and  Crowd- 
sourcing. Turner  (2006)  refers  to  Neogeography  as  a set  of  techniques  and  tools 
that  fall  outside  the  realm  of  traditional  GIS.  Goodchild  (2007)  coined  the  term 
Volunteered  Geographic  Information  (VGI)  which  is  one  of  the  most  widely 
deployed  and  disseminated  terms  and  Sieber  (2007)  refers  to  Crowdsourcing 
in  the  way  that  it  is  currently  used.  All  terms  investigate  GI  and  find  innovative 
adjectives  to  shed  light  on  different  aspects  of  the  phenomenon,  which  can  be 
simply  understood  as  voluntary  manipulation  of  spatial  data  by  citizens. 

Within  the  new  web -based  era  and  the  great  possibilities  offered  to  users  via 
dynamic  maps  and  new  technologies  to  manipulate  data,  another  field  presents 
great  interest  due  to  its  socio-economic  perspectives.  Land  Administration  is 
a relatively  new  term,  which  was  introduced  in  the  1990s.  Theoretically,  Land 
Administration  is  defined  as  the  process  of  determining,  recording  and  dis- 
seminating information  about  the  tenure,  value  and  use  of  land  when  imple- 
menting land  management  policies  (UNECE  1996).  Many  academics  have 
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investigated  Land  Administration  and  have  contributed  to  various  aspects  of 
the  field.  Dale  and  McLaughlin  (1999),  Williamson  (2001)  and  Bogaerts  et  al. 
(2002)  are  just  a few  of  the  academics  contributing  to  this  field. 

Cadastre  is  a significant  part  of  Land  Administration  and  has  various  defi- 
nitions according  to  different  scientific  sources.  The  United  Nations  (in  1985) 
and  the  International  Federation  of  Surveyors  (FIG)  (in  1995)  gave  the  pre- 
dominant definitions.  The  former  underlines  that  ‘the  cadastre  is  a methodi- 
cally arranged  public  inventory  of  data  on  the  properties  within  a certain 
country  or  district  based  on  a survey  of  their  boundaries;  such  properties  are 
systematically  identified  by  means  of  some  separate  designation.  The  outlines 
of  the  property  and  the  parcel  identifier  are  normally  shown  on  large-scale 
maps’  (UN,  1985  In:  Steudler,  2004:  13).  The  latter  notes,  ‘A  cadastre  is  nor- 
mally a parcel  based,  and  up-to-date  land  information  system  containing  a 
record  of  interests  in  land  (e.g.  rights,  restrictions  and  responsibilities).  It  usu- 
ally includes  a geometric  description  of  land  parcels  linked  to  other  records 
describing  the  nature  of  the  interests,  the  ownership  or  control  of  those  inter- 
ests, and  often  the  value  of  the  parcel  and  its  improvements’”  (FIG  1995  In: 
Steudler,  2004:  14).  The  need  for  an  accurate  and  up-to-date  cadastral  system 
is  so  vital  that  Kaufmann  and  Steudler  (1998),  in  their  publication  ‘Cadastre 
2014’,  support  the  inclusion  of  public  and  traditional  law  aspects  in  the  defini- 
tion given  by  FIG  in  1995. 

Cadastral  survey  according  to  Steudler  (2004:  14)  is  ‘simply  defined  as  a sur- 
vey of  boundaries  of  land  units’.  Generally,  cadastre  is  an  essential  tool  for  land 
management  and  administration  as  it  records  the  land  parcels  which  constitute 
a part  of  a country’s  spatial  information  infrastructure.  The  Bogor  Declaration 
on  Cadastral  Reform  (UN-FIG  1996)  in  other  words  declared  that  the  develop- 
ment of  modern  cadastral  infrastructures  facilitate  efficient  land  and  property 
markets,  protect  the  land  rights  of  all,  and  support  long  term  sustainable  devel- 
opment and  land  management.  It  also  facilitates  the  planning  and  development 
of  national  cadastral  infrastructures  so  that  they  may  fully  service  the  escalat- 
ing needs  of  greatly  increased  urban  populations. 

The  main  question  that  is  posed  in  this  chapter,  taking  into  consideration  all 
the  above  perspectives  on  technology  and  potential  scenarios,  is  whether  VGI 
and  crowdsourcing  techniques  more  generally  can  be  incorporated  in  cadastral 
surveys  and  to  what  extent.  The  testing  of  the  results  is  done  through  the  Hel- 
lenic Cadastre  project,  which  is  a well-known,  long-lasting  project.  The  idea  of 
incorporating  VGI  into  the  cadastral  procedure  is  based  on  the  power  of  local- 
ity and  the  participation  of  citizens  in  land  planning  as  active  parts  of  society. 


The  official  cadastral  procedure 

The  HC  Project  started  in  1995  and  cadastral  surveys  have  been  carried  out 
in  340  regions  all  over  the  country  while  106  cadastral  offices  have  already 
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begun  operations  in  these  regions.  The  responsible  agency  for  the  HC  project 
is  the  National  Cadastre  and  Mapping  Agency  S.A.  (NCMA  S.A.).  The  Hellenic 
Cadastre  is  a uniform,  public,  systematic  and  on-going  title  registration  infor- 
mation system  in  fully  digital  form  with  spatial  and  attribute  records  of  each 
land  parcel.  Before  investigating  on  the  alternative  crowdsourced  proposal,  an 
overview  of  the  official  procedure  is  thoroughly  presented. 

The  main  cadastral  survey  includes  the  following  stages  (Basiouka  & Potsiou 
2012)  (Figure  1): 

• Declarations  are  submitted  to  the  cadastral  survey  offices  by  the  right  hold- 
ers and  the  registration  of  the  declared  rights  is  added  to  a digital  database. 

• Interim  cadastral  tables  and  diagrams  are  formed  based  on  the  data  from 
the  submitted  declarations,  which  has  been  processed  by  lawyers  and 
surveyors. 

• Interim  cadastral  maps  and  data  are  published  for  a two-month  period  and 
extracts  are  sent  to  the  rights  holders  for  their  information  and  acceptance. 


Submission  of  property 
declarations 


Formation  of  interim 
cadastral  tables  and 
diagrams 


Suspension  of  the  interim 
cadastral  data 


Submission  of  objections 


Reformation  of  cadastral 
data 


Commencement  of  the 
Cadastral  Office’s  operation 
in  the  particular  area 


Figure  1:  Main  Cadastral  Stages. 
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• Objections  or  applications  for  correction  of  a cadastral  registration  are  sub- 
mitted and  forwarded  to  independent  administrative  committees,  depend- 
ing on  the  case,  by  whoever  has  a legal  right. 

• The  cadastral  data  is  reformed  after  examination  of  the  objections  and  the 
correction  claims  and  the  final  cadastral  tables  and  diagrams  are  formed. 
These  registrations  are  called  Initial  Registrations  and  they  constitute  the 
first  registration  in  the  Hellenic  Cadastre. 

• The  cadastral  office  in  operation  in  the  particular  area  replaces  the  old  Land 
Registry  Office  (NCMA  2011). 

Given  that  the  official  process,  according  to  Greek  legislation,  is  based  on  a core 
scientific  number  of  stages  where  surveyors,  lawyers  and  IT  technicians  are 
the  ones  to  collect,  edit,  store  and  handle  data,  the  proposed  idea  has  proven 
somewhat  innovative.  To  date,  citizens  are  excluded  at  all  stages,  except  from 
the  declaration  of  their  private  property  at  the  beginning  of  the  process. 


The  Proposed  Crowdsourcing  Cadastral  Model 

The  proposed  crowdsourced  model  for  cadastral  surveys  is  the  result  of  the 
main  lessons  that  have  been  collected  from  a series  of  successful  governmen- 
tal crowdsourced  projects  that  were  analyzed  in  depth  (Haklay  et  al.  2014);  the 
requirements  posed  by  the  nature  of  a cadastral  project;  the  innovative  ideas  the 
researcher  has  decided  to  introduce;  and  the  necessity  for  coordination  with  for- 
mal cadastral  procedures  and  realities.  As  previous  research  carried  out  by  Hak- 
lay et  al.  (2014)  indicated,  the  key  factors  for  a successful  crowdsourcing  experi- 
ment include  building  on  previous  experience,  leveraging  existing  technology 
and  having  the  support  of  key  partners  such  as  governments  or  other  authorities. 
The  proposed  model  aims  to  take  advantage  of  previous  experience  of  official 
procedures  and  to  introduce  innovative  techniques  that  will  facilitate  them. 

The  structure  of  the  proposed  model  is  divided  into  four  main  sections:  the 
adopted  main  lessons  identified  from  the  successful  case  studies,  which  will  be 
incorporated  in  various  parts  of  the  model:  the  proposed  workflow,  which  will 
replace  the  official  procedure;  and  finally  the  participants  and  stakeholders  of 
the  project.  The  aim  of  the  research  is  to  launch  a model  that  may  be  applied 
and  used  either  at  a local  or  a national  level;  and  it  may  have  a wider  application 
in  many  countries  and  communities  that  face  land  issues. 


The  successful  crowdsourced  case  studies 

The  research  on  the  successful  case  studies  was  carried  out  by  Haklay  et  al. 
(2014).  An  investigation  was  made  of  the  incentives,  scope  and  aims  behind  the 
practical  experiments;  their  participants  and  stakeholders;  their  relationships 
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and  the  modes  of  engagement.  The  research  also  shed  light  on  technical  aspects, 
successful  factors  and  the  problems  that  were  encountered  in  the  evaluation  of 
the  examples. 

Each  case  study  is  intended  to  provide  an  example  of  the  use  of  VGI  by 
government  or  by  the  public,  and  summarizes  the  context,  positive  and  nega- 
tive outcomes  and  main  lessons.  The  focus  of  6 case  studies  is  divided  into 
three  main  categories:  land  management,  public  administration  and  disaster 
response.  Although  the  case  studies  are  differentiated  by  content  and  scope, 
all  of  them  are  governmental  projects  that  incorporate  voluntary  and  crowd- 
sourced data  collection,  so  their  study  can  isolate  those  components  that  are 
crucial  for  the  success  of  the  current  project.  The  main  lessons  derived  from  the 
case  studies  and  the  outcomes  are  described  briefly  below,  with  text  taken  from 
Haklay  et  al  (2014,  CC-BY): 

Mapping  of  South  Sudan  This  was  launched  because  of  the  need  for  a tem- 
porally accurate  and  up-to-date  map  when  the  new  nation  was  created.  Google 
Map  Maker,  the  Sudanese  diaspora  and  various  organizations  carried  out  work- 
shops to  train  people  to  work  separately  on  the  digitization  of  aerial  imagery. 
A significant  amount  of  work  was  completed  in  a very  short  amount  of  time 
by  adopting  local  knowledge  and  providing  technical  tools.  Those  who  experi- 
enced the  training  sessions  were  inspired  to  transmit  the  experience  and  recruit 
new  volunteers. 

Main  lessons: 

• Crowdsourcing  projects  can  be  coordinated  and  implemented  from  a 
distance. 

• Great  participation  of  volunteers  and  transmission  of  motivation  to  others 
are  key  factors  in  terms  of  participation  in  crowdsourcing  applications. 

Informal  settlement  mapping,  MapKibera,  Nairobi,  Kenya.  Map  Kibera  was 
carried  out  in  the  most  crowded  slum  in  Nairobi,  Kenya,  in  an  effort  to  improve 
its  reputation  and  offer  an  accurate  picture  of  the  area,  which  is  quite  dynamic 
due  to  the  movement  of  the  population.  Local  people  collected  and  edited  GPS 
tracks.  Innovative  techniques  such  as  SMS,  video  and  voice  reporting  were  also 
launched  and  a small  amount  of  compensation  offered  to  participants. 

Main  lessons: 

• Mapping  can  be  achieved  by  young  local  people  relatively  quickly. 

• Basic  topographic  maps  can  be  enhanced  with  essential  thematic  layers. 

• A combination  of  open  source  and  conventional  software  can  facilitate  VGI 
projects. 

• Compensation  may  be  needed  to  improve  participation  in  locations  where 
participants  suffer  great  survival  issues. 

• Innovative  methods  such  as  SMS,  voice  and  video  reporting  can  support 
the  appeal  of  mapping  projects. 
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Crowdsourcing  The  National  Map,  National  Map  Corps,  US.  National  Map 
Corps  has  given  volunteers  the  choice  to  collect  and  edit  data  about  ten  dif- 
ferent human-made  structures  in  fifty  states  in  an  effort  to  provide  accurate 
and  authoritative  spatial  data.  The  methodology  includes  various  steps  such 
as  adding  new  features,  removing  obsolete  points  and  correcting  existing 
data.  A pilot  test  in  Colorado  showed  that  the  VGI  was  satisfactory  in  its 
accuracy. 

Main  lessons: 

• Adoption  of  challenging  techniques  such  as  gamification  has  been  success- 
ful and  attracts  volunteer  interest. 

• Evaluation  of  the  quality  indicated  that  the  participation  improves  accuracy 
and  reduces  the  errors. 

• Organizational  resistance  to  accepting  data  from  volunteers  is  one  of  the 
major  challenges  for  VGI  projects  of  this  kind. 

• Key  factors  to  successful  crowdsourcing  include  building  on  past  experi- 
ence, leveraging  existing  technology  and  having  the  support  of  key  indi- 
viduals within  the  organization. 

iCitizen,  mapping  service  delivery,  South  Africa.  This  project  is  at  the  design 
phase  and  aims  to  involve  the  public  at  a local  level  to  collect  data  points  via 
mobile  phones  and  adopt  different  ways  for  geotagging  of  photos  in  real  time  or 
via  SMS  and  email.  The  purpose  is  to  report  infrastructure  issues. 

Main  lessons: 

• Projects  can  be  used  for  a variety  of  tasks  at  local  level,  not  just  that  for 
which  they  were  designed. 

• Using  a range  of  software,  programming  languages  and  platforms  can 
broaden  a projects  horizons. 

• VGI  applications  face  financial  issues  due  to  their  nature  and  the  resources 
of  the  organizations  involved. 

• Concerns  from  agencies  about  the  quality  of  generated  data  sets  and 
improving  public  adoption  of  mobile  applications  are  common  challenges. 

Haiti  disaster  response.  One  of  the  most  well-known  crowdsourcing  applica- 
tions, developed  after  the  earthquake  hit  Haiti  in  2010.  Within  48  hours,  the 
capital  was  mapped  by  volunteers  who  contributed  from  every  part  of  the 
world  to  create  an  up-to-date  topographic  map  to  fill  the  gap  left  by  the  official 
mapping  agency.  The  maps  were  used  by  various  organizations  to  allocate  sup- 
plies and  medicine. 

Main  lessons: 

• An  integrated  methodology  of  this  kind  follows  four  steps:  spatial  data  con- 
tributed by  official  providers,  supplemented  with  GPS  tracks,  integrated 
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into  OpenStreetMap  and  updated  by  a great  number  of  volunteers  from 
each  part  of  the  world. 

• Time,  cost  and  official  trust  of  data  by  Non-Governmental  Organizations 
(NGOs)  and  other  official  partners  are  key  to  success. 

• Lack  of  coordination  and  experience  between  different  actors  can  lead  to 
duplication  of  data  and  waste  of  resource. 

• Differentiation  between  conventional  and  governmental  data  in  terms  of 
engagement  to  the  project  did  not  prevent  success. 

Community  Mapping  for  Exposure  in  Indonesia.  The  goal  of  this  project  is  to 
reduce  vulnerability  to  natural  disasters.  Young  people  have  successfully  col- 
lected spatial  and  attribute  data,  and  traced  them  in  the  OSM  platform  so  that 
thematic  maps  can  be  created  to  show  potential  damage  in  case  of  physical 
disasters. 

Main  lessons: 

• Interaction  between  official  providers  and  VGI  is  a parameter  of  success  not 
only  for  the  beginning  of  the  project  but  also  for  its  continuity. 

• Open  source  data  can  be  reliable  for  scenario  building  but  its  quality  can 
vary,  especially  in  terms  of  attribute  data. 

• The  coordination  of  participating  organizations  and  volunteers  is  impor- 
tant to  take  full  advantage  of  human  resources  and  technical  innovations. 


The  main  lessons  adopted 

Within  the  specific  alternative  proposed  model  for  cadastral  surveys,  a series 
of  main  lessons,  technical  aspects  and  successful  factors  are  adopted  as  vital 
for  the  viability  of  the  model.  The  majority  of  them  serves  the  nature  of  the 
research  and  plays  a crucial  role  in  its  design.  The  predominant  key  factors  are 
given  below  and  are  explained  in  terms  of  the  proposed  cadastral  model. 

• Training  and  workshops  should  take  place  at  the  beginning  of  each  appli- 
cation. Most  of  the  practical  applications  that  constitute  successful  crowd- 
sourced paradigms  - such  as  Community  Mapping  for  Exposure  in  Indo- 
nesia, MapKibera  in  Kenya  and  mapping  of  South  Sudan  - introduced 
workshops  as  a key  factor  for  their  success.  This  is  of  vital  importance  due 
to  the  technical  requirements  of  a cadastral  survey  and  the  quality  controls 
that  should  be  satisfied  at  the  end  of  any  project.  The  workshops  have  three 
different  targets:  (a)  to  inform  locals  about  the  necessity  and  benefits  of  the 
project,  (b)  to  recruit  them  and  (c)  to  train  them  to  properly  manipulate 
spatial  and  attribute  data. 

• Recruitment  of  volunteers.  Except  for  local  people  - land  owners  who 
will  play  a vital  role  in  data  collection  - the  proposed  model  is  based 
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on  non-governmental  organizations  and  undergraduate  students  of  the 
schools  of  Surveying,  Geography  or  any  other  relevant  field.  Both  will  keep 
the  cost  low  and  the  students  will  support  the  cadastral  surveys  both  prac- 
tically and  technically  The  undergraduate  students  will  participate  actively 
in  the  data  collection  and  editing.  The  idea  for  this  came  accidentally  due 
to  the  nature  of  the  study,  however,  Community  Mapping  for  Exposure  in 
Indonesia  had  already  recruited  undergraduate  students  to  collect  a huge 
amount  of  data  in  a relatively  short  time  and  scholarships  were  offered  in 
exchange.  The  idea  behind  the  recruitment  of  non-governmental  organiza- 
tions is  to  carry  out  all  workshops  at  the  beginning  of  the  process.  Students 
should  also  deal  with  the  great  amount  of  data  that  should  be  collected  and 
manipulated.  The  main  question  over  the  participation  of  land  owners  as 
volunteers  is  in  terms  of  their  motivations.  Land  owners  may  be  more  eas- 
ily recruited  if  cadastral  fees  were  eliminated,  or  similar  taxation  rates  low- 
ered, as  a result  of  their  participation.  Experience  from  other  countries  has 
indicated  that  the  incentives  that  lead  volunteers  to  participate  in  crowd- 
sourcing projects  are  a mixture  of  various  parameters.  The  land  owners  will 
participate  voluntarily  for  altruistic  reasons  if  they  recognize  the  necessity 
of  the  project,  how  their  lives  will  be  affected  by  its  implementation  and 
how  their  properties  will  be  protected.  This  is  especially  the  case  for  those 
who  face  difficulties  in  land  transactions,  use  and  development.  However, 
the  majority  will  be  motivated  if  a compensation  rate  is  offered  to  them. 

• Partnership  with  scientific  organizations.  Collaboration  between  the  vari- 
ous organizations  is  proposed  within  a well-defined  and  compact  pattern 
where  the  roles  and  duties  are  made  clear  from  the  beginning  of  the  pro- 
cedure. To  avoid  the  lack  of  coordination  and  duplication  of  data  noted  in 
Haiti  disaster  response,  each  stakeholder  should  be  responsible  for  a spe- 
cific part  of  the  cadastral  survey  and  all  the  participants  should  work  in 
cooperation  in  order  to  produce  the  final  result.  Supervision  and  quality 
controls  should  also  be  carried  out  by  experts.  The  main  innovation  of  the 
project,  which  has  a national  application,  is  based  on  the  participation  of 
various  NGOs  at  local  level.  The  NGOs  that  take  action  at  local  level  should 
be  responsible  and  should  participate  actively  in  the  cadastral  surveys  in 
these  areas. 

• Crowdsourcing  projects  can  be  coordinated  and  implemented  from  a dis- 
tance. The  importance  of  this  keynote  is  crucial  for  the  success  of  the  first 
phase  of  the  cadastral  survey  where  citizens  should  digitally  declare  their 
ownership  from  a distance  instead  of  hard  copy  declarations  and  indicat- 
ing their  ownership  on  digital  orthophotos  at  the  cadastral  offices.  The 
phase  may  be  carried  out  in  digital  form  and  implemented  from  a distance. 
The  attribute  data  declaration  may  be  replaced  by  an  online  declaration 
with  the  aid  of  Apps  or  online  tools  and  the  field  work  for  spatial  data  col- 
lection in  urban  areas  can  be  replaced  by  the  online  digitization  of  land 
parcels  on  orthophotos  provided  by  the  website  of  the  official  mapping 
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agency  or  on  OSM.  This  specific  strategy  worked  efficiently  in  the  mapping 
of  South  Sudan. 

• Open  - source  and  commercial  software  should  be  used  because  experi- 
ence has  indicated  that  their  combination  offers  the  required  freedom  and 
openness  for  the  project.  The  informal  settlement  mapping  in  Nairobi, 
Kenya,  flourished  with  the  aid  of  open-source  and  conventional  software. 
Open-source  software  may  be  used  in  data  collection  while  conventional 
software  may  be  used  in  data  manipulation.  The  contribution  of  volunteers 
in  data  collection  requires  the  use  of  flexible  and  easy-to-use  tools.  The  edit- 
ing of  data  demands  advanced  functionalities  that  may  be  available  only 
in  conventional  GIS  software  packages.  Previous  experience  has  indicated 
that  using  a range  of  software,  programming  languages  and  platforms  may 
broaden  a project’s  horizons. 

• Innovative  techniques.  In  an  effort  to  meet  the  requirements  of  the  pro- 
ject, especially  in  terms  of  quality,  a series  of  innovative  techniques  have 
been  adopted  in  data  collection  and  manipulation.  For  the  first  time, 
OSM  is  tested  for  use  in  data  manipulation  and  storage  in  cadastral  sur- 
veys. Also,  smartphones  and  handheld  GPS  gradually  replace  the  expen- 
sive equipment  that  is  needed  for  data  collection.  All  these  innovations 
were  introduced  to  propose  an  alternative,  viable  solution  to  the  official 
procedure. 


The  Workflow 

The  workflow  of  the  proposed  model  follows  a general  pattern  that  includes  all 
different  occasions  of  mapping  in  rural  or  urban  areas,  and  it  constitutes  of  four 
main  steps.  The  proposed  model  may  be  applied  at  the  beginning  of  a cadastral 
survey  and  it  may  replace  the  first  phase  of  official  cadastral  mapping.  The  hard 
copy  declaration  of  ownership,  the  identification  on  orthophotos  of  proper- 
ties in  the  rural  areas  and  the  acceptance  of  the  produced  result  are  only  a few 
stages  of  the  official  process  that  may  be  modified.  Although  the  procedure  may 
be  categorized  in  these  four  stages,  it  may  include  further  parameters  that  are 
identified  within  practical  experiments  (Figure  2). 

Training  of  volunteers  is  the  first  step  of  the  process.  The  NGOs  and  the 
undergraduate  students  will  train  the  land  owners  both  theoretically  and  prac- 
tically. The  award  for  the  participation  of  students  will  be  in  the  form  of  schol- 
arships and  work  experience. 

Data  collection  as  the  second  step  of  the  workflow  differs  depending  on  the 
nature  of  the  area  and  the  kind  of  data  that  should  be  collected.  Rural  areas 
require  a different  approach  in  comparison  to  urban  ones  while  spatial  data 
collection  requires  different  tools  to  attribute  data.  Data  in  rural  areas  may 
be  collected  by  using  handheld  GPS  devises,  tablets  or  smartphones  while  in 
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Figure  2:  The  workflow  of  the  proposed  Cadastral  Model. 


urban  areas  by  using  accurate  orthophotos  provided  via  the  website  of  the  offi- 
cial mapping  agency  or  various  online  maps  such  as  OSM  in  case  of  a lack  of 
other  accurate  basemaps. 

In  rural  areas,  the  citizens  may  either  declare  their  ownership  by  giving  the 
point  of  its  centroid  (Figure  3,  right)  in  a quick  and  inexpensive  approach  or 
collect  their  parcel  boundaries  (Figure  3,  left)  with  expert  support  after  been 
trained.  Both  approaches  fit  in  interim  cadastral  mapping.  The  methodology 
may  vary  depending  on  parcels  shape. 

In  urban  areas,  citizens  can  simply  declare  their  ownerships  by  using  online 
tools  and  orthophotos  as  basemaps  which  are  provided  via  the  website  of  the 
official  mapping  agency  or  online  dynamic  maps  such  as  OSM.  The  choice 
depends  on  the  availability  of  the  recourses.  The  experience  has  indicated  that 
OSM  can  be  adopted  for  cadastral  purposes,  taking  into  consideration  specific 
generalizations  and  rules  (Basiouka  et  al.  2015). 

The  attribute  data  collection  can  be  also  implemented  with  the  aid  of  online 
databases,  which  can  store  and  maintain  the  saved  information  so  that  the  hard 
copy  declaration  to  be  replaced. 

Data  evaluation,  as  the  third  step  of  the  workflow,  should  be  guaranteed  by 
the  national  mapping  agency,  which  will  also  be  responsible  for  the  mainte- 
nance and  storage  of  the  data  in  its  servers.  Thus,  one  of  the  most  important 
concerns  in  terms  of  viability  - the  difficulty  of  keeping  data  up  to  date  - can 
be  bypassed. 
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Figure  3:  Different  approaches  in  spatial  data  collection. 


The  Stakeholders  of  the  Model 

The  model  is  based  on  the  participation  of  citizens  in  a hybrid  approach  where 
citizens  and  property  owners  participate  as  volunteers  and  experts  work  as  team 
leaders  supervising  the  whole  process.  Thus,  the  hybrid  approach  involves  both 
amateurs  and  experts  for  its  implementation.  In  order  to  be  successful,  it  requires 
the  participation  of  locals,  experts  and  NGOs,  while  the  coordination  of  these 
parties  should  be  carried  out  by  the  official  mapping  agency.  Specific  parts  of  the 
public  sector  should  also  participate  in  certain  parts  of  the  procedure  (Figure  4). 

The  term  ‘locals  refers  to  the  land  owners  and  volunteers  who  will  participate 
in  the  cadastral  surveys.  Their  participation  should  be  based  on  data  collection 
and  editing  during  the  first  steps  of  the  process  and  the  acceptance  of  the  result 
at  the  end  of  the  cadastral  survey.  Citizens  may  collect  not  only  spatial  but  also 
attribute  data.  The  locals  constitute  the  base  of  the  pyramid  that  is  given  in 
Figure  4. 

The  next  level  of  the  pyramid  refers  to  the  NGOs  who  will  train  volunteers 
in  data  collection  and  editing.  Their  coordination  highlights  another  impor- 
tant parameter:  the  cooperation  of  participating  organizations  and  volunteers 
is  crucial  for  the  full  advantage  of  human  resources  and  technical  innovation. 
Their  interaction  is  also  a parameter  of  success  required  not  only  at  the  begin- 
ning of  the  project  but  throughout. 

Furthermore,  the  public  sector  should  act  to  supplement  these  actors  and 
provide  the  required  equipment  such  as  GPS  and  total  stations,  computers, 
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Figure  4:  The  stakeholders  of  the  Project. 


vehicles  and  facilities.  The  local  authorities  should  also  play  a critical  role  in 
data  collection  and  they  will  contribute  effectively  to  the  collaboration  among 
teams  of  local  people  at  the  stage  of  data  collection,  depending  on  their  knowl- 
edge. This  ensures  that  participants  can  contribute  to  specific  tasks  and  stages 
of  the  data  collection. 

The  national  mapping  agency  should  keep  a supervising  role  and  should  be 
actively  involved  in  all  stages  of  the  process.  Coordination  of  the  procedure, 
evaluation  of  the  result  and  maintenance  of  the  data  in  their  server  are  their 
predominant  tasks. 


Concluding  Remarks 

The  model  proposed  here  follows  the  trend  that  has  been  adopted  during  the 
last  decade  by  the  governmental  bodies  to  open,  inexpensive,  quick  and  trans- 
parent processes  in  Land  Administration  Systems  with  the  participation  of 
citizens  and  the  use  of  new  technologies.  The  proposed  model  constitutes  a 
general  process  that  may  be  applied  or  modified  according  to  the  special  needs 
of  the  official  mapping  agency  and  can  be  adopted  in  various  land  administra- 
tion issues  in  various  countries.  Its  main  innovation  is  based  on  the  combina- 
tion of  current  trends  and  previous  experience  in  similar  case  studies.  Its  main 
advantage  is  stated  not  only  through  the  participation  of  citizens  in  important 
decision  making  policies  but  also  through  its  flexibility.  There  are  still  many 
concerns  and  parameters  that  have  not  yet  been  identified,  quality  and  pri- 
vacy queries  that  a conservative  part  of  the  society  poses,  or  problems  of  cred- 
ibility that  make  governmental  bodies  reluctant  to  use  it.  However,  it  is  clear 
that  the  new  crowdsourcing  trend  is  not  temporary  and  will  continue  to  play 
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a fundamental  role  in  this  recently  begun  revolution.  The  general  outcome  is 
condensed  not  in  a unique  solution  but  in  a general  idea  that  should  be  flexible 
and  can  be  easily  modified  based  on  the  needs  of  society  and  the  available  tools 
and  funds. 
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Abstract 

Within  the  last  ten  years,  Volunteered  Geographic  Information  (VGI)  has 
developed  rapidly  and  significantly  influenced  the  world  of  GIScience.  Most 
prominently,  the  OpenStreetMap  (OSM)  project  maps  our  world  in  a detail 
never  seen  before  in  user-generated  maps  on  the  one  hand.  On  the  other  hand, 
most  of  the  urban  area  on  our  planet  has  been  covered  by  hundreds  of  mil- 
lions of  photos  uploaded  on  social  media  platforms,  such  as  Flickr,  Instagram, 
Mapillary,  Weibo  and  others.  These  data  are  directly  or  indirectly  geo-refer- 
enced and  can  be  used  to  extract  3D  information  and  model  our  world  in  the 
3D  environment.  At  the  current  stage,  several  approaches  have  been  made  avail- 
able to  visualise  and  generate  the  3D  world  mainly  using  OpenStreetMap  data. 
The  3D  buildings  are  unfortunately  restricted  to  a coarse  level  of  detail,  since  no 
further  information  about  facade  structure  is  available  on  OSM.  In  this  paper, 
the  current  work  of  reconstruction  of  3D  buildings  by  using  the  data  both  on 
OSM  and  Flickr  is  presented,  whereby  facade  structure  could  be  extracted 
from  Flickr  images  and  OSM  footprints  can  be  used  to  accelerate  the  process  of 
dense  image  matching  and  to  improve  the  accuracy  of  geo -referencing. 
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Introduction 

In  recent  years,  the  term  Volunteered  Geographic  Information  (VGI)  became 
popular,  whereat  VGI  describes  how  an  ever- expanding  range  of  users  collabo- 
ratively  collects  geographic  data  (Goodchild  2007a).  That  is,  hobbyists  create 
geographic  data  based  on  personal  measurements  (via  GPS  etc.)  and  share 
those  in  a Web  2.0  community,  resulting  in  a comprehensive  data  source  with 
humans  acting  as  remote  sensors  (Goodchild  2007b).  With  a global  cast  of  vol- 
unteers, OpenStreetMap  (OSM)  is  considered  as  one  of  the  most  successful  and 
popular  Volunteered  Geographic  Information  (VGI)  projects.  In  its  current 
state,  there  are  more  than  two  million  registered  members  (OSM  2015)  who 
contribute  to  the  rapid  growth  of  OSM.  Recent  investigations  on  its  complete- 
ness and  quality  have  shown  that  urban  areas  in  Central  Europe  in  particular 
have  already  been  mapped  with  an  impressive  level  of  detail  (Neis  et  al.  2012). 
In  those  areas,  OSM  is  well  ahead  of  only  mapping  the  street  network.  For  the 
continuous  improvement  of  OSM  it  is  crucial  to  enable  the  mapping  of  even 
more  detailed,  three-dimensional  spatial  information. 

Several  years  ago,  Over  et  al.  (2010)  investigated  the  possibility  of  creating 
a 3D  virtual  world  by  using  OSM  data  for  different  applications,  and  drew  the 
conclusion  that  OSM  has  huge  potential  for  fulfilling  the  requirements  of  Cit- 
yGML  LODI  (Groger  et  al.  2008),  which  is  modelled  as  block  models  regionally. 
With  the  rapid  development  of  OSM  in  recent  years,  especially,  sparked  by  the 
availability  of  high- resolution  imagery  from  Bing  since  2010,  there  has  been  an 
increase  in  building  information  in  OSM,  proving  that  volunteers  do  not  only 
contribute  roads  or  points  of  interest  (POIs)  to  the  database.  According  to  the 
latest  statistics  (the  values  are  derived  from  our  internal  OSM  database,  which 
is  updated  daily),  the  number  of  buildings  in  OSM  is  above  200  million,  thereof 
18.4  million  building  footprints  in  Germany.  The  research  of  Fan  et  al.  (2014) 
demonstrated  that  the  building  footprints  data  on  OSM  has  a high  degree 
completeness  and  semantic  accuracy.  There  is  an  offset  of  about  four  meters 
on  average  in  terms  of  position  accuracy.  With  respect  to  shape,  OSM  building 
footprints  have  high  similarity  to  those  in  authority  data.  Moreover,  there  is 
more  and  more  information  about  building  height  and  roof  structures,  which  is 
required  for  the  3D  reconstruction.  From  this  point  of  view,  one  can  say  that  it 
is  possible  to  model  the  virtual  world  in  3D  from  OSM  data.  The  3D  data  could 
be  further  enriched  when  introducing  related  information  from  other  VGI  pro- 
jects, such  as  Flickr,  WikiMapia,  Panoramio,  Instagram  and  Dronestagram. 

This  paper  provides  a detailed  perspective  on  the  generation  of  3D  city 
models  by  using  VGI  data  that  is  mainly  based  on  OSM  data.  First,  it  gives  an 
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overview  of  the  data  sources  that  could  be  used  for  the  3D  model  generation, 
then  the  mechanism  to  generate  3D  city  models  will  be  described.  Many  of  the 
proposed  algorithms  have  been  implemented  within  the  OSM-3D  and  Open- 
BuildingModels  projects,  which  will  also  be  introduced.  Finally,  this  paper  will 
give  conclusions,  a discussion  and  suggestions  for  future  work. 


VGI  as  a data  source  for  3D  reconstruction 

The  earliest  approach  for  sharing  3D  models  using  the  principle  of  every- 
one for  everyone  is  Google  3D  Warehouse  launched  on  April  24,  2006.  This 
shared  repository  contains  user-generated  3D  models  of  both  geo -referenced 
real-world  objects,  such  as  churches  or  stadiums,  and  non-geo -referenced  pro- 
totypical objects,  such  as  trees,  light  posts  or  interior  objects  like  furniture. 
The  former  also  appear  in  Google  Earth.  In  order  to  voluntarily  contribute, 
users  have  to  have  a certain  level  of  3D  modelling  skill.  The  main  focus  of  this 
repository  does  not  lie  in  assembling  3D  city  models  as  the  non-geo-referenced 
objects  seem  to  be  more  important  in  related  work.  They  are,  for  example,  used 
to  improve  methods  of  automatic  object  recognition  in  the  field  of  laser  scan 
classification  or  robotic  vision.  In  addition,  the  3D  warehouse  models  are  being 
integrated  in  several  commercial  systems,  such  as  design  tools  or  simulation 
software.  However,  the  number  of  3D  buildings  and  many  3D  city  facilities,  such 
as  bridges,  bus  stations  and  fuel  stations,  has  increased  in  recent  years,  thanks 
to  the  development  of  3D  modelling  computer  programs  such  as  SketchUp  and 
ESRI  CityEngine,  which  make  3D  editing  more  easy  and  effective. 

In  2007,  VGI  was  introduced  by  Goodchild  (2007a, b)  to  describe  the  recent 
revolution  of  collaboratively  created  spatial  information  on  Web  2.0.  Almost 
at  the  same  time,  Microsoft  Virtual  Earth  and  Google  Earth  launched  their 
pioneer  projects  in  the  way  of  VGI  or  crowdsourcing.  The  projects  are  called 
3D  VIA  (Virtual  Earth)  and  Building  Maker  (Google  Earth).  Both  of  them  pro- 
vide a model  kit  to  create  buildings,  deriving  the  3D  geometry  from  a set  of 
oblique  (and  proprietary)  birds-eye  images  of  the  same  object  from  different 
perspectives.  In  contrast  to  the  3D  Warehouse,  this  tool  specifically  aims  at  geo- 
referenced  3D  building  models  only.  It  is  intended  for  people  who  do  not  have 
knowledge  in  3D  modelling,  but  still  want  to  contribute. 

In  addition  to  3D  Warehouse,  there  are  several  free-to-use  3D  object  reposi- 
tories on  the  internet,  for  example  OpenSceneryX6,  Archive3D7  or  Shape- 
ways8.  These  projects  emerged  from  entirely  different  communities  with  inter- 
est in,  for  example,  flight  simulators  or  3D  printing.  The  contents  usually  lack 
connection  to  the  real  world  but  can  nonetheless  be  useful  to  enrich  real  3D 
city  model  visualisations. 

The  above  mentioned  projects  serve  to  directly  share  and  collect  3D  mod- 
els by  means  of  crowdsourcing.  The  data  used  for  3D  reconstruction  might  be 
commercial  or  authority  data.  In  fact,  3D  city  models  can  also  be  reconstructed 
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using  2D  vector  or  image  data  contributed  by  crowdsourcing.  The  OSM  com- 
munity has  not  only  captured  roads  and  paths,  but  also  more  and  more  POIs, 
land  use  areas  and  even  buildings.  The  latter  can  be  extracted  and  extruded 
into  3D.  At  present,  there  are  several  projects  that  generate  and  visualise  3D 
buildings  from  OSM:  OSM-3D,  OSM  Buildings,  Glosm,  OSM2World,  etc. 
The  major  limitation  of  these  projects  is  that  the  majority  of  buildings  are  only 
modelled  at  coarse  level  of  detail.  When  applying  the  concept  of  levels  level  of 
details  (LoDs)  introduced  in  CityGML  (Groger  et  al.  2008),  these  buildings 
are  actually  LoDl,  i.e.  they  are  reconstructed  by  extruding  footprints  with  flat 
roofs.  In  OSM-3D,  a number  of  buildings  are  modelled  in  LOD2  in  cases  where 
there  are  indications  for  their  roof  types.  Further,  it  is  possible  to  integrate  more 
detail,  however,  usually  manually  generated  buildings  (LOD3  or  LOD4)  from 
other  sources  via  OpenBuildingModels  (Uden  & Zipf  2012). 

Flickr  is  another  VGI  project  that  is  often  used  for  the  reconstruction  of  3D 
buildings.  Preliminary  experiments  on  reconstructing  3D  scenes  from  Flickr 
imagery  have  been  made  available  by  Snavely  et  al.  (2006;  2008)  and  Agarwal 
et  al.  (2011).  However,  Flickr  imagery  is  almost  untapped  and  unexploited  by 
computer  vision  researchers,  in  particular  when  it  comes  to  deriving  repre- 
sentations suitable  for  GIS.  A major  reason  is  that  the  imagery  is  not  in  a form 
that  is  amenable  to  processing  (Snavely  et  al.  2008).  The  photos  are  unstruc- 
tured - they  are  taken  in  no  particular  order,  and  have  uncontrolled  distri- 
bution of  the  camera  viewpoints  und  unclear  positional  accuracy.  In  addi- 
tion, they  are  uncalibrated  (Argarwal  et  al.  2011),  and  with  widely  variable 
illumination,  resolution,  and  image  quality.  All  this  increases  the  difficulty  in 
image  registration  and  sparse  3D  reconstruction.  Furthermore,  the  existing 
approaches  are  developed  based  on  dense  image  matching  which  leads  to  high 
computation  costs. 


Generating  3D  city  models  from  OSM  data 

A city  normally  consists  of  street  network,  land  uses,  buildings,  point  features 
and  others.  These  would  be  handled  separately  for  3D  modelling  and  visualisa- 
tion. It  should  be  pointed  out  that  a digital  terrain  model  (DTM)  is  required 
from  other  sources  (i.e.  open  data)  for  the  3D  visualisation,  because  OSM  does 
not  contain  any  information  about  terrain. 


Integrating  OSM  land  uses  within  3D  Terrain  Surface 

In  fact,  it  is  hard  to  integrate  OSM  data  within  3D  terrain  surface  because 
OSM  data  is  recorded  in  2D.  The  problem  can  be  solved  by  overlapping  OSM 
data  as  a liquid  net  over  the  terrain  surface  as  a solid  object  and  preserving 
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the  characteristics  of  OSM  features  (i.e.  a football  ground  should  be  flat)  dur- 
ing the  process.  In  principle,  there  are  three  alternative  ways  to  display  OSM 
land-use  data  in  3D.  This  data  can  be  displayed  by  mapping  raster  images  onto 
a digital  elevation  model  (DEM),  by  overlaying  vector  data  on  the  DEM  or  by 
combining  the  vector  data  and  the  DEM  in  an  integrated  triangulated  irregular 
network  (TIN).  Schilling  et  al.  (2007)  proposed  an  approach  to  integrate  the 
road  surface  into  the  triangulation  of  the  DTM,  which  is  represented  by  a set 
of  Triangulated  Irregular  Networks  (TINs).  This  means  that  the  road  surface 
becomes  a part  of  the  TIN.  The  street  network  is  treated  as  a layer  consisting  of 
a collection  of  polygons  representing  all  the  individual  network  segments.  The 
borders  of  the  polygons  are  integrated  into  the  TIN  as  fully  topological  edges 
so  that  we  can  distinguish  between  triangles  that  are  part  of  the  street  surface 
and  the  remaining  triangles. 

The  resulting  triangles  within  the  polygon  receive  the  attributes  of  the  source 
features  and  can  be  coloured  for  visualisation.  Another  advantage  of  this 
approach  is  that  all  layers  can  be  styled  by  the  user  on  demand  via  a 3D  styled 
layer  descriptor  (3D-SLD),  which  is  an  enhancement  of  the  OGC  SLD  standard 
(Neubauer  & Zipf  2007). 

After  integrating  the  street  network  with  the  surface  layer,  the  street  surfaces 
within  the  DTM  have  to  be  smoothed  and  corrected,  because  linear  features 
like  ditches,  smaller  dikes,  walls,  the  rims  of  terraces  and  especially  the  hard 
border  edges  of  roads  can  be  only  represented  insufficiently  due  to  the  low 
resolution  of  the  DTM.  An  example  can  be  seen  in  Figure  1.  Sometimes  the 
road  sidelines  seem  to  be  frayed.  At  steep  hillsides,  the  road  surface  is  inclined 
sideways.  The  situation  is  of  course  even  worse  with  lower-resolution  DTM 
data  sources. 

Another  common  way  to  support  linear  features  is  to  include  break  lines  dur- 
ing the  terrain  triangulation,  e.g.  using  the  Constrained  Delaunay  Triangulation 


Figure  1:  Comparison  between  the  original  terrain  surface  (left)  and  with  the 
flattened  road  segment  (right). 
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(CDT).  However,  break  lines  are  seldom  available.  However,  one  can  correct 
the  parts  of  surfaces  representing  areas  that  should  be  actually  more  or  less  flat. 
A comparison  between  the  situation  before  the  correction  and  afterwards  is 
shown  in  Figure  1.  It  is  much  more  likely  that  the  middle  line  takes  a smooth 
course  between  the  river  and  the  hillside  with  approximately  the  same  height, 
and  that  the  profile  is  nearly  horizontal,  as  can  be  seen  on  the  right  side,  instead 
of  being  very  bumpy  and  uneven. 


3D  building  objects 

In  OSM,  building  footprints  are  modelled  as  closed  polygons.  For  creating  3D 
models,  the  height  must  be  derived  from  other  OSM  attributes  (called  tags). 
The  key  height,  as  well  as  the  key  building:height,  ought  to  contain  informa- 
tion about  the  height  of  a building.  If  such  information  is  not  available,  as 
an  alternative,  the  keys  levels,  buildingdevels  and  building:levels:aboveground 
can  be  utilised  for  an  approximation  of  the  building  height  (by  multiply- 
ing the  number  of  levels  with  an  average  level  height  of  3.5  meters).  The  key 
building:min_levels  also  needs  to  be  considered  because  it  describes  the  indi- 
vidual elevation  of  a building,  thus  the  space  between  the  ground  and  the 
building  (part). 

When  computing  building  geometries,  it  is  also  interesting  to  generate 
proper  roof  geometries.  The  keys  building:roof:shape,  building:roof:style 
and  building:roof:type  contain  a semantic  description  of  the  roof  shape,  such 
as  gabled  roof  or  hipped  roof.  In  contrast,  the  key  building:roof  is  supposed 
to  contain  information  about  the  material  of  the  roof,  although  it  often  also 
contains  roof  shape  information.  Similar  to  this  key,  building:roof:material 
can  contain  information  about  the  roof  material.  Besides  those  keys,  there 
are  also  some  other  relevant  keys  for  roof  generation.  Building:roof:extent 
describes  the  extent  of  the  roof,  thus  the  actual  distance  between  the  roof 
edge  and  the  building  facade.  For  describing  the  orientation  of  the  roof,  the 
key  building:roof:orientation  is  applied:  if  the  roof  ridge  is  parallel  to  the 
longer  roof  side,  the  value  is  along;  otherwise  the  value  is  across.  The  gen- 
eration of  roof  geometries  for  simple  building  footprints,  that  is  footprints 
with  rectangular  shape  or  those  that  only  consist  of  four  points,  is  straight- 
forward and  can  be  applied  with  adequate  performance  to  the  OSM  dataset. 
For  more  complex  roofs,  such  as  those  with  holes  or  arbitrary  shapes,  the 
generation  of  roof  geometries  is  quite  challenging.  Some  early  results  have 
already  been  gained  by  using  procedural  extrusion  with  Skeleton  computa- 
tions, but  until  now  a broad  application  of  those  algorithms  for  the  whole 
OSM  on  the  one  hand  is  very  time  consuming  (about  factor  100)  and  on 
the  other  hand  does  not,  due  to  special  cases  and  exceptions,  lead  to  satisfy- 
ing results.  A detailed  description  of  the  building  generation  process  can  be 
found  in  Gotz  and  Zipf  (2011). 
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The  generation  of  3D  buildings  from  OSM  data  is  achieved  by  using  the  algo- 
rithm create3dBuildingModel , as  below: 

Algorithm  create3dBuildingModel(G,  A) 

Input:  G = 2D  Geometry  (Polygon)  from  OSM 
Input:  A = Attributes  as  OSM  key- value  pairs 
1:  3dm []  <--  empty 

2:  height  <--  getHeight(  A [height],  A[building:height],  A [levels], 

A [buildingdevels] , A [building:levels:aboveground] ) 

3:  roofShape  <—  getRoofShape(A[building:roof:shape], 

A [building:roof:style] , A [building:roof:type] ) 

4:  roofAttr  <--  getRoofType( A [building:roof: extent], 

A [building:roof:orientation] , A [building:roof:angle] , 

A [building:roof:height] ) 

5:  roofColor  <—  getRoofType(A[building:roof:colour], 

A [building:roof:color] ) 

6:  color  <--  getRoofType(A[building:colour],  A[building:color], 
A[building:fa9ade:colour],  A [building:  fa^adexolor]) 

7:  body  <--  computeBuildingBody(G,  height,  color) 

8:  roof  <--  computeRoof(G,  roofShape,  roofAttr,  roofColor) 

9:  building  <--  combine(body,  roof) 

10:  triangulate  (building) 

Output:  3D  Building 


3D  indoor  modelling 

The  3D  indoor  environment  of  buildings  could  be  generated  by  using  the 
indoor  information  mapped  on  OSM  using  IndoorOSM.  It  is  an  OSM-based 
indoor  extension  proposed  by  Gotz  and  Zipf  (2011).  The  schema  follows  exist- 
ing OSM  methodologies;  thus,  it  only  uses  nodes,  ways,  relations  and  key- value 
pairs.  That  is,  existing  OSM  editors,  such  as  JOSM7  or  Potlatch,  8,  are  suitable 
for  mapping  IndoorOSM  data.  The  schema  is  defined  as  follows:  a whole  build- 
ing is  represented  as  one  OSM  relation,  whereas  the  different  relation  members 
(the  children  of  the  relation)  are  the  different  building  levels  (floors).  A level 
itself  consists  of  one  or  several  closed  way(s)  for  representing  the  shell  of  the 
level,  that  is  the  outer  boundary,  and  several  other  closed  ways  representing  the 
inner  building  parts  (e.g.  rooms,  corridors,  etc.). 

3D  information  such  as  the  height  of  a level  or  the  height  of  a building  part 
is  attached  as  a key- value  pair  to  the  corresponding  OSM  feature  with  the  key 
height  and  corresponding  values  (e.g.  3, 6 ft,  etc.,  the  default  unit  is  meter).  That 
is,  for  each  level  and  its  inner  parts,  a two-dimensional  (2D)  footprint  geom- 
etry plus  additional  3D  information  is  available.  Further  semantic  information, 
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such  as  room  names,  level  names,  level  numbers  and  so  on  are  attached  as  key- 
value  pairs  to  the  corresponding  OSM  feature. 

In  IndoorOSM,  information  about  windows  is  provided  by  adding  nodes  to 
the  OSM  features  that  represent  the  level  shells.  Thereby,  the  location  of  the 
node  represents  the  2D  centre  of  the  window  (from  a birds  perspective).  Infor- 
mation about  the  breadth,  width  and  height  is  attached  via  corresponding  keys. 

Point  features 

In  OSM,  point  features  have  been  captured  for  an  abundance  of  different  loca- 
tions, shops,  restaurants,  facilities,  technical  installations  and  so  on.  They 
provide  in  part  very  deep  information,  which  enables  applications  that  go 
far  beyond  the  static  display  of  map  content.  For  some  categories,  a tagging 
schema  has  been  established  for  storing  typically  useful  information  about 
a specific  type  of  facility.  The  schema  for  restaurants,  for  instance,  includes 
name,  address,  opening  hours,  cuisine,  telephone  number  and  URL  of  the 
homepage. 

The  primary  OSM  key  for  this  kind  of  node  is  amenity’.  The  value  describes 
the  type,  which  can  be  used  to  assign  an  icon  or  symbol.  The  generic  name’  key 
may  be  used  for  an  additional  label.  The  amenity  types  have  been  divided  into 
the  categories:  accommodation,  eating,  education,  enjoyment,  health,  money, 
post,  public  facilities,  public  transport,  shop  and  traffic.  Each  category  is  pro- 
vided as  an  individual  layer  through  the  Web3D  Service. 

For  the  3D  environment,  these  point  features  should  be  classified  into  two 
classes:  points  as  additional  attributes  of  buildings,  and  points  as  location 
indication  of  city  facilities.  The  first  class  of  points  can  be  integrated  to  their 
corresponding  buildings  by  using  text-matching  algorithms.  The  second  class 
(Figure  2a.)  stands  for  the  objects  of  city  facilities,  such  as  bus  station,  traf- 
fic signal,  post  box,  tree  and  streetlights.  These  objects  of  city  facilities  can  be 
modelled  in  3D  by  using  generic  3D  objects  (Figure  2b.),  because  they  have 
unified  shapes  and  sizes  in  the  city  (Groger  et  al.  2008). 


Generic  objects  3D  public 

phone  booth 

Figure  2:  Generic  objects  in  a city  and  an  example  of  3D  representation  of  a 
public  phone  cell. 
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The  OSM-in-3D  projects 

The  most  advanced  work  in  the  context  of  creating  3D  city  models  from  VGI 
data  is  the  OSM-3D  project  developed  at  Heidelberg  University.  It  combines  the 
extrusion  of  building  footprints  into  the  third  dimension  with  a detailed  inte- 
grated terrain  model  derived  from  SRTM  height  data.  It  provides  the  3D  data  in 
a standardised  manner  through  a Web  3D  Service  (W3DS).  The  OSM-3D  W3DS 
supports  different  terrain  generalisation  levels  and  provides  tiled  3D  scenes, 
based  on  the  requested  point  of  view,  in  VRML,  X3D,  COLLADA  or  KML  for- 
mat. Tailored  client  software  called  XNavigator  has  also  been  developed,  which 
automatically  requests  the  data  from  the  W3DS  server  and  assembles  complex 
3D  landscapes  worldwide.  This  client  also  allows  the  integration  of  other  OGC 
Web  Services,  such  as  a Web  Feature  Service  (WFS),  the  OpenGIS  Location  Ser- 
vices or  the  Sensor  Observation  Service.  Thus,  for  example,  POIs  or  3D  routes 
can  be  included.  The  interoperability  with  different  data  sources  (e.g.  also  Cit- 
yGML),  web  services  and  targets  has  recently  been  examined  within  the  OGC 
3D  Portrayal  Interoperability  Experiment.  Figure  3 shows  the  user  interface  of 
XNavigator  with  3D  city  models  in  Heidelberg,  Germany.  The  wide  applicability 
of  the  W3DS  and  XNavigator  with  heterogeneous  data  could  be  demonstrated. 

3D  buildings  in  OSM-3D  are  generated  using  the  automated  process 
described  in  Section  3.2.  The  drawback  of  this  kind  of  automated  approach 
is  that  the  buildings  can  only  be  generated  with  coarse  geometries.  In  other 


Figure  3:  OSM-3D  overview  of  Heidelberg  in  XNavigator. 
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Figure  4:  OpenBuildingModel. 


words,  buildings  can  only  be  modelled  as  block  models  (LoDl)  or  models  with 
roof  structures  (LoD2).  Architectural  details  on  facades  unfortunately  cannot 
be  modelled.  Aiming  to  acquire  3D  buildings  with  detailed  geometries  (LoD3 
and  LoD4),  the  OpenBuildingModel  project  was  launched  in  2012.  It  is  a web- 
based  platform  for  uploading  and  sharing  entire  3D  building  models.  In  line 
with  this  project,  a user-friendly  web  interface  (see  Figure  4.)  has  been  devel- 
oped, which  allows:  (i)  uploading  3D  building  models  (modelled  by  internet 
users)  associated  with  a footprint  in  OSM  (Figure  4a),  and  (ii)  browsing,  view- 
ing and  downloading  existing  models  in  the  repository  (Figure  4b).  The  pro- 
cessing of  the  OSM  data  and  setting  up  of  a model  repository  in  the  first  proto- 
type comprises  several  steps.  First,  building  footprints  have  to  be  derived  from 
the  OSM  data  separately  and  overlaid  as  a vector  layer  by  importing  the  OSM 
data  into  a PostgreSQL/PostGIS  database  with  the  Osmosis  tool.  Then  build- 
ing footprints  can  be  operated  interactively  and  individually  on  the  web-client, 
whereby  GeoServer  is  deployed  for  the  data  provision.  By  selecting  a build- 
ing footprint,  one  can  upload  his/her  own  3D  building  (with  textures)  created 
offline.  At  the  same  time,  it  is  also  possible  to  add  attributive  information. 


Future  work 

Collaborative  mapping  in  3D  is  more  difficult  than  in  2D,  because  basic  knowl- 
edge and  skills  about  3D  modelling  are  essential  when  creating  3D  buildings 
manually  or  in  a semi-automated  way.  From  this  point  of  view,  one  cannot 
expect  much  contribution  of  3D  building  models  through  data-sharing  plat- 
forms such  as  3D  Warehouse  and  OpenBuildingModel.  In  order  to  have  3D 
buildings  with  detailed  facade  structures  at  regional,  country  and  even  global 
scales,  an  alternative  solution  has  to  be  provided. 

One  possible  solution  might  be  the  combination  of  the  two  VGI  projects:  OSM 
and  Flickr.  Building  footprints,  height  information  and  further  semantic  tags 
given  in  OSM  will  be  used  as  known  information  for  extracting  facade  geometries 
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from  dense  unorganised  Flickr  photos.  In  addition,  attributive  information  both 
in  OSM  and  in  Flickr  are  to  be  integrated  into  the  3D  building  structures.  Thus, 
the  resulting  city  models  will  not  only  be  appropriate  for  visualisation  tasks  but 
also  usable  for  further  analysis,  e.g.  urban  planning,  emergency  management  and 
simulations  for  energy  consumption.  To  achieve  this,  novel  intelligent  modelling 
concepts  for  3D  city  models  that  can  cope  with  the  growing  needs  and  require- 
ments arising  in  the  area  of  geo -information  science  have  to  be  developed. 

However,  there  are  still  many  challenges  that  make  the  3D  reconstruction 
from  VGI  data  somehow  difficult.  First  of  all,  the  data  is  heterogeneous  in  qual- 
ity, completeness  and  accuracy.  An  automated  approach  may  achieve  good 
results  in  some  regions  but  fails  in  other  regions.  Secondly,  although  there  are 
a large  number  of  images  on  Flickr  for  landmark  buildings,  there  are  still  not 
enough  to  obtain  robust  results  for  dense  image  matching,  in  order  to  acquire 
detailed  geometries  of  3D  buildings.  Images  on  other  crowdsourcing  platforms 
(e.g.  Wikipedia,  Weibo  and  Tweeter)  may  also  be  used  for  3D  reconstruction 
purposes.  The  third  issue  is  the  quality  of  the  3D  buildings.  This  can  be  evalu- 
ated by  using  authority  data  in  some  regions  where  authority  data  can  be  made 
available.  But  the  quality  of  3D  buildings  created  in  this  way  is  difficult  to  be 
controlled  from  the  sources  due  to  the  diversity  of  the  personal  capabilities  of 
internet  contributors  in  terms  of  operating  with  geo -data. 
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Glossary 


The  Glossary  has  been  compiled  by  Linda  See  and  Cristina  Capineri  and  revised 
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Terminology 

Definition 

Ambient  geo- 
graphic informa- 
tion (AGI) 

AGI  refers  to  geographic  data  that  are  passively  contributed 
by  ctizens,  e.g.  Twitter,  which  can  then  be  used  for  another 
purpose,  e.g.  studying  human  behaviour/patterns  in  the  data. 
The  term  first  appeared  in  Stefanidis  et  al.  (2013). 

Augmented  reality 

Augmented  reality  ( AR)  refers  to  the  ability  to  annotate 
places  in  the  geoweb  with  multiple  layers  of  information: 
real-world  environment  are  supplemented  (or  augmented) 
by  computer- generated  sensory  inputs  such  as  sound,  video, 
graphics,  texts  or  GPS  data.  AR  is  indeterminate,  unstable, 
context  dependent  and  subjective  (Graham  et.  al.  2012:  465; 
Graham,  Zook  & Boulton  2013). 
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Citizen  science 

Citizen  science  is  the  engagement  of  the  general  public  in  scien- 
tific research  including  collection  of  data,  hypothesis  generation 
and  the  design  of  research  experiments  (Bonney  et  al.  2009a). 

The  Christmas  Bird  Count  is  a prime  example  in  which  ama- 
teur ornithologists  are  enlisted  to  conduct  a mid-winter  census 
of  bird  populations.  Participants  require  a fairly  high  level 
of  skill,  and  over  the  years  a number  of  protocols  have  been 
established  to  ensure  that  the  resulting  data  have  high  quality. 
Citizen  science  is  also  about  education,  whereby  volunteers 
learn  new  skills  and  gain  a better  understanding  of  scientific 
research  (SOCIENTIZE,  2013). 

Coded  space 

Coded  spaces  are  parts  of  the  environment  in  which  code  (for- 
malised rules  for  information  into  other  outputs  and  represen- 
tations) or  software  contributes  to  the  production  or  enaction 
of  that  space.  In  Zook  and  Poorthius  (2014)  coded  space  emer- 
ges when  software  and  space  become  mutually  constituted  in 
this  process  of  transduction:  space  produces  code  and  code  on 
its  turn  produces  space  (Dodge  and  Kitchin  2005). 

Collaborative 

mapping 

Collaborative  mapping  is  the  collective  creation  and  annota- 
tion of  web-based  maps  and  UGC  online  (MacGillavry,  2003). 

Collaboratively 
contributed  geo- 
spatial information 

First  appearing  in  Bishr  and  Kuhn  (2007),  this  term  refers  to 
user  generated  geospatial  information  and  is  a precursor  to 
Volunteered  Geographic  Information  (VGI). 

Collective 

Intelligence 

According  to  the  collective  intelligence  handbook  of  MIT  cen- 
ter, the  idea  of  the  collective  intelligence  is  not  new,  although 
during  the  last  decade  its  meaning  has  changed.  The  term 
combines  by  collective  the  individual  actors  such  as  people, 
computational  agents,  organizations  and  by  intelligent,  the 
collective  behavior  of  the  group  exhibits  characteristics.  The 
most  representative  and  simple  definition  given  in  the  hand- 
book underlines  that  coolective  intelligence  is  about  groups  of 
people  and  computers,  connected  by  the  Internet,  collectively 
doing  intelligent  things. 

Collective  Sensing 

The  concept  has  been  explained  by  as  analysing  aggregated 
anonymised  data  coming  from  collective  networks  such  as 
Twitter,  Flickr,  Foursquare.  It  differs  from  the  idea  of  People 
as  Sensors  (people  contributing  subjective  observations)  and 
Citizen  Science  (exploiting  and  elevating  expertise  of  citizens 
and  their  personal,  local  experiences)  (Resch  2012). 
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Collective  Sensing  is  an  infrastructure-based  approach,  which 
tries  to  leverage  existing  ICT  networks  to  generate  contextual 
information.  Unlike  smartphone-based  or  specialised  web 
apps,  which  examine  single  input  data  sets,  Collective  Sensing 
holistically  analyses  events  and  processes  in  a network.  For 
instance,  increased  traffic  in  the  mobile  phone  network  might 
be  an  indicator  for  the  presence  of  a dense  crowd  of  people 
(Reades  et  al.  2007).  This  information  is  generated  without 
having  to  use  a single  persons  data  and  their  personal  details. 

Content 

Content  is  used  to  describe  a particular  form  of  information 
that  can  be  presented  to  an  audience.  It  is  synonymous  with 
some  form  of  creative  work.  Content  is  understood  to  be 
found  in  texts,  movies,  music,  books. 

Contributed  Geo- 
graphic Informa- 
tion (CGI) 

CGI  was  proposed  by  Harvey  (2013)  as  geographic  informa- 
tion that  is  contributed  by  users  via  an  opt-out’  agreement  in 
contrast  to  VGI,  which  is  collected  via  an  opt-in  agreement. 
Opt-out  agreements  are  more  open-ended  and  have  fewer 
possibilities  for  controlling  the  data  collection.  This  has  impli- 
cations for  quality,  bias  and  assessing  the  fitness-for-use  of  the 
data. 

Credibility 

Flanagin  & Metgzer  (2008)  notices  credibility’s  importance  as  a 
combination  of  both  trust  and  expertise  and  explains  that  tru- 
stworthiness and  expertise  have  both  objective  and  subjective 
components.  With  regard  to  VGI,  it  is  useful  to  consider 
credibility  according  to  the  degree  to  which  people’s  spatial  or 
geographic  information  is  unique  and  situated,  and  the  extent 
to  which  its  acquisition  requires  specialized,  formal  training. 

If  VGI  refers  to  perception  based  collections,  generally  carried 
out  by  “locals”  or  “insiders”  credibility  rests  on  the  extent  to 
which  a representative  sample  of  people  provide  their  personal 
input  honestly  and  accurately.  Collective  intelligence  can  fun- 
ction well  in  many  circumstances,  yet  it  is  also  subject  to  biases 
through  processes  of  bandwagon  effects  and  groupthink. 

Crowdsourcing 

In  2006,  the  term  crowdsourcing  was  introduced  by  Howe, 
which  is  literally  the  combination  of  the  words  crowd’  and 
‘outsourcing’.  Thus  crowdsourcing  refers  to  the  outsourcing 
of  micro-tasks  to  large  volumes  of  people  in  order  to  get  a 
task  done  that  would  not  have  been  possible  through  more 
traditional  means. 

Crowdsourcing 
Geographic  Infor- 
mation (CrGI) 

This  term  is  similar  to  VGI  and  appeared  in  a paper  by 
Goodchild  and  Glennon  (2010)  in  the  context  of  disaster 
management. 
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Digital  footprint 

Digital  footprint  refers  to  the  traces  of  data  that  are  left  by 
users  on  digital  services.  There  are  two  main  classifications  for 
digital  footprints:  passive  and  active.  A passive  digital  footprint 
is  created  when  data  is  collected  about  an  action  without  any 
client  activation  (e.g.  a click,  the  user  IP  address),  whereas 
active  digital  footprints  are  created  when  personal  data  is 
released  deliberately  by  a user  for  the  purpose  of  sharing  infor- 
mation about  oneself  (e.g.  a user  being  logged  into  a site  when 
making  a post  or  edit). 

Digiplace 

Zook  and  Graham  (2007)  define  digiplace  as  a heuristic  for 
the  subjective  mixing  of  the  code,  data  and  material  places.  It 
encompasses  the  situatedness  of  individuals  balanced  between 
the  visible  and  the  invisible,  the  fixed  and  the  fluid,  the  space 
of  places  and  the  space  of  flows  and  the  blurring  of  the  lines 
between  material  place  and  the  digital  representations  of  place. 

Digital  divide 

The  idea  of  “digital  divide”  has  been  introduced  long  time 
ago  but  only  recently  applied  to  geographic  information  by 
Goodchild  (2009)  and  M.Graham  (2014)  who  discuss  the 
issue  that  remains  largely;  the  preservation  of  those  fortu- 
nate to  have  access  to  the  Internet— and  broadband  access 
in  particular  and  names  it  a digital  divide.  They  explain 
that  while  a growing  fraction  of  citizens  in  developed  coun- 
tries have  such  access,  it  is  largely  unavailable  to  the  majo- 
rity of  the  world’s  population  who  live  in  developing  countries 
either  as  users  or  as  contributors. 

Extreme  citizen 

science 

Extreme  citizen  science  can  be  attributed  to  Muki  Haklay  and 
his  team  at  UCL  (Excites).  Extreme  citizen  science  refers  to  the 
highest  level  of  citizen  participation,  which  includes  problem 
definition,  data  collection  and  analysis. 

Geocoding 

ArcGIS  Resource  Center  gives  the  definition  of  geocoding  as 
the  process  of  transforming  a description  of  a location — such 
as  a pair  of  coordinates,  an  address,  or  a name  of  a place — to 
a location  on  the  earth’s  surface.  Davis  et  al.  (2003)  states  geo- 
coding as  the  process  of  locating  points  on  the  surface  of  the 
Earth  from  alphanumeric  addressing  data  associated  to  events. 

Geocollaboration 

Geocollaboration  was  first  defined  by  MacEachren  and 

Brewer  (2004)  in  which  technologies  enable  collaboration  that 
involves  geospatial  information  in  a visually- enabled  manner. 
Geocollaboration  can  be  used  for  knowledge  construction 
and  refinement;  design;  decision-support;  and  training  and 
education.  The  multidisciplinary  nature  of  geocollaboration 
has  been  highlighed  by  Tomaszewski  (2010),  which  brings 
together  human- computer  interaction,  computer  sicnece  and 
psychology. 
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Geographic  citizen 
science 

This  refers  to  citizen  science  where  the  location  information  is 
a key  part  of  the  citizen  science  activites. 

Geographic  Infor- 
mation Science 

Goodchild  (1992)  has  first  introduced  the  term  “Geographic 
Information  Science”  for  issues  raised  around  the  techno- 
logy that  collects  and  manages  spatial  data  in  a technological 
manner.  GIS  is  designed  for  storing,  managing,  analysing  and 
presenting  spatial  information.  As  reported  by  Longley  et  al. 
(2005)  knowing  where  something  happens  turns  a situation 
from  a curiosity- driven  to  a practical  problem-solving  science. 

Georeferencing 

Hackeloeer  et  al  (2014)  gives  the  technical  definition  of 
georeferencig.  According  to  them,  to  georeference  means  to 
associate  something  with  locations  in  physical  space.  The  term 
is  commonly  used  in  the  geographic  information  systems  field 
to  describe  the  process  of  associating  a physical  map  or  raster 
image  of  a map  with  spatial  locations.  Georeferencing  may  be 
applied  to  any  kind  of  object  or  structure  that  can  be  related 
to  a geographical  location,  such  as  points  of  interest,  roads, 
places,  bridges,  or  buildings. 

Geotag 

Goodchild  (2007)  supports  that  a geotag  is  a standardized 
code  that  can  be  inserted  into  information  in  order  to  note  its 
appropriate  geographic  location.  Geotags  have  been  inserted 
into  many  Wikipedia  entries,  when  the  contents  relate  to  a 
specific  location  on  the  Earth’s  surface,  and  several  sites  allow 
such  entries  to  be  accessed  from  maps. 

Antoniou  et  al.  (2010)  gives  a technical  definition  of  associa- 
ting images  with  the  co-ordinates  of  their  capture  locations. 

GeoWeb  or 
GeoSpatialWeb 
aka  Geographic 
World  Wide  Web 

The  GeoWeb  is  the  merging  of  spatial  information  with  attri- 
bute (i.e.  non-spatial  information)  on  the  web.  The  organiza- 
tion of  information  is  by  location,  allowing  spatial  searching  of 
the  internet.  The  GeoWeb  2.0  is  a system  of  systems  (i.e.  GIS 
clients  and  servers,  service  providers,  GIS  portals,  standards 
and  collaboration  agreements)  (McGuire,  2005). 

Hybrid  geography 

Hybrid  geography  refers  to  practice  new  synthesis  in  geo- 
graphic research  by  challenging  existing  boundaries  and  adop- 
ting creative  connections  within  geographies  (physical  and 
human,  critical  and  analytical,  qualitative  and  quantitative),  it 
suggests  the  integration  of  perspectives  on  space,  place,  flow, 
and  connection  (Sui  & DeLyser  2012;  Whatmore  2002). 

Involuntary 

geographic 

information 

(iVGI) 

This  refers  to  VGI  that  has  not  been  voluntarily  provided  by  the 
individual,  where  the  data  are  then  used  in  different  types  of 
applications,  e.g.  behavioural  studies.  The  term  first  appeared  in 
Fischer  (2012)  and  is  similar  to  AGI. 

Map  Hacking 

Map  hacking  is  the  creation  of  clever  solutions  with  digital 
maps,  e.g.  using  Gooogle  Maps. 
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Map  Mashup 

A map  mashup  is  the  aggregation  of  spatial  data  from  different 
sources  to  create  an  online  map  (Wong,  2008;  Sui,  2009). 

Motivation 

Motivation  is  a fundamental  aspect  in  crowdsourcing  activities 
and  for  volunteers  manipulating  geographic  information. 

Clary  & Miller  (1986),  Clary  & Orenstein  (1991),  Penner  & 
Finkelstein  (1998)  are  among  the  very  first  studies  which 
figured  out  the  motivations  of  volunteerism  in  general.  One 
important  factor  is  the  self  - promotion  that  an  individual 
hopes  to  gain  from  the  participation  in  the  project  (Coleman 
et  al.  2009).  Basiouka  and  Potsiou  (2013),  developed  further 
research  and  carried  out  a Survey  over  citizens’  motivations  in 
crowdsourced  governmental  projects. 

Neogeography 

Neogeography  literally  means  new  geography’  and  consists 
of  techniques,  tools  and  practices  of  geography  that  have  been 
traditionally  beyond  the  scope  of  professional  geographers  and 
GIS  practitioners  and  are  now  being  used  by  citizens  (Turner, 
2007).  Goodchild  and  Turner  debated  the  relationship  between 
VGI  and  neogeography  in  Wilson  and  Graham  (2013b).  In 
terms  of  data  practices,  Goodchild  argues  the  two  are  identical 
but  as  a new  paradigm  for  interaction  between  people  and 
geography,  it  provides  a much  broader  perspective.  Turner 
argues  that  neogeography  is  not  limited  to  the  information  but 
that  the  intelligence  and  complexity  provided  by  citizens  plays 
an  important  role. 

Ontologies 

Ontology  is  a branch  of  metaphysics  and  it  is  defined  as  the 
science  of  being  but  during  the  last  decade  GIscience  has 
embraced  ontologies  as  a new  research  direction  towards  a 
better  understanding,  representation  and  communication  of 
geographic  information.  As  regards  user  generated  content 
ontologies  and  semantic  representation  of  geographic  informa- 
tion are  relevant  in  order  to  be  properly  understood  and  used 
even  by  non  expert  (Kavouras  and  Kokla,  2011). 

OpenStreetMap 

The  most  representative  example  of  online  mapping  which  is 
based  on  volunteers’  participation  is  OpenStreetMap  (OSM); 
it  is  a global  online,  up-to-date,  dynamic  map  having  derived 
from  civil  society.  OSM  was  launched  by  Steve  Coast  in  2004 
in  United  Kingdom  and  was  spread  in  many  countries  in 
relatively  short  time.  As  the  welcome  message  says  in  the  main 
home  page  of  OSM:  “The  OSM  is  a free,  editable  map  of  the 
whole  world.  It  is  made  by  people  like  you”. 
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Participatory 

sensing 

Participatory  Sensing  is  the  contribution  of  georeferenced 
data  via  end  user  devices  such  as  smartphones  (Zacharias  et 
al.  2012).  Similar  to  the  idea  of  people  as  sensors  (Goodchild 
2007),  the  definition  is  more  restricted,  focussing  on  input 
devices,  data  acquisition  and  information  processing. 

Prosumers 

Prosumer  is  a portmanteau  formed  by  contracting  either  the 
word  professional  or,  less  often,  producer  with  the  word  consu- 
mer. For  example,  a prosumer  grade  digital  camera  is  a “cross” 
between  consumer  grade  and  professional  grade.  In  the  1980 
book,  The  Third  Wave,  futurologist  Alvin  Tofiler  coined  the 
term  “prosumer”  when  he  predicted  that  the  role  of  producers 
and  consumers  would  begin  to  blur  and  merge  (even  though 
he  described  it  in  his  book  Future  Shock  from  1970).  Tofiler 
envisioned  a highly  saturated  marketplace  as  mass  production 
of  standardized  products  began  to  satisfy  basic  consumer 
demands.  To  continue  growing  profit,  businesses  would  initiate 
a process  of  mass  customization,  that  is  the  mass  production  of 
highly  customized  products. 

Public 

Participation 

The  international  association  for  public  participation  under- 
lines the  meaning  of  citizens’  participation  that  affected  by 
or  interested  in  a decision.  Public  Participation  is  a general 
term  that  can  be  applied  in  a variety  of  projects  with  the  active 
involvement  of  citizens  and  may  involve  public  meetings, 
surveys,  open  houses,  workshops,  polling,  citizens  advisory 
committees  and  other  forms  of  direct  engagement  with  the 
public. 

Public  participa- 
tion geographic 
information 
system 

Sieber  (2006)  was  the  first  who  coined  that  PPGIS  pertains  to 
the  use  of  geographic  information  systems  (GIS)  to  broaden 
public  involvement  in  policymaking  as  well  as  to  the  value  of 
GIS  to  promote  the  goals  of  non  governmental  organizations, 
grassroots  groups,  and  community-based  organizations.  The 
idea  behind  PPGIS  is  empowerment  and  inclusion  of  margi- 
nalized populations,  who  have  little  voice  in  the  public  arena, 
through  geographic  technology  education  and  participation. 
PPGIS  uses  and  produces  digital  maps,  satellite  imagery,  sketch 
maps,  and  many  other  spatial  and  visual  tools,  to  change  geo- 
graphic involvement  and  awareness  on  a local  level. 

Public  participa- 
tion in  scientific 
research  (PPSR) 

This  term  is  synonomous  with  citizen  science  and  denotes 
public  involvement  in  all  aspects  of  scientific  research  (Bonney 
et  al.,  2009b). 
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Quality  criteria 

Quality  is  defined  as  “totality  of  characteristics  of  a product 
that  bear  on  its  ability  to  satisfy  stated  and  implied  needs” 

(ISO  2002,  originally  in  ISO  standard  8402).  The  quality  of 
spatial  data  is  related  to  the  purpose  of  creation,  to  its  use 
and  to  its  lineage.  Quality  aspects  that  mainly  affect  VGI  data 
usability  are  several,  among  which  accuracy  and  precision  of 
geo-location  and  of  observations,  completeness  and  intelligibi- 
lity of  contents,  as  well  as  the  reliability  of  information  and  the 
trustworthiness  of  the  data  source.  The  information  content 
validity,  known  also  as  intrinsic  quality  (Bordogna,  et  al.  2014), 
depends  on  a combination  of  factors  such  as 
lineage, 

positional  accuracy, 
attribute  accuracy, 
logical  consistency, 
completeness  (Moellering,  1987). 

These  factors  make  data  fit  for  a given  use.  It  is  therefore 
dependent  on  contents’  inherent  characteristics. 

According  to  Ciepluch  et  al.,  2011  literature  on  OSM  contri- 
bution apply  as  quality  criteria:  density  of  users  contributions, 
spatial  density  of  points  and  polygons,  types  of  tags  and  meta- 
data used,  dominant  contributors  in  an  area.  Various  studies 
have  been  carried  out  to  evaluate  VGI’s  accuracy.  Haklay  et  al. 
(2010)  did  a further  research  proving  that  a sensible  number  of 
volunteers  per  area  can  eliminate  the  errors  in  mapping. 

Quality  techniques 

Criscuolo  et  al.  2013  divided  the  quality  techniques  in  two  main 
categories.  Methods  applicable  to  audit  these  quality  features 
could  include  ex- ante  combined  in  different  ways  guided  such 
as  filling  of  protocols,  the  use  of  web-forms  with  fixed  fields, 
the  use  of  metadata  automatically  created  by  the  measuring 
device  (i.e.  GPS  information  associated  to  a picture  taken  with 
a smartphone),  the  use  of  ontologies  and  geographical  gazet- 
teers, of  volunteer  contributors;  and  ex-post  techniques  (i.e. 
geo -statistical  filters  automatically,  but  also  by  human  experts, 
in  order  to  detect  biases  and  maintain  the  data  consistency). 

Relevance 

According  to  Cosijn  and  Ingwersen  (1999),  relevance  has 
become  a major  area  of  research  in  the  field  of  Information 
Retrieval,  despite  the  fact  that  the  concept  relevance  is  not 
well  understood.  Saracevic  (1996)  previously  underlined  that 
nobody  has  to  explain  to  users  of  IR  systems  what  relevance 
is,  even  if  they  struggle  (sometimes  in  vain)  to  find  relevant 
contents.  People  understand  relevance  intuitively.  However, 
various  studies  on  relevance  view  IR  as  a cognitive  interaction 
between  human  and  computer.  Furthermore,  there  are  many 
types  of  relevance:  system  or  algorithmic;  topical,  pertinence; 
situational;  motivational. 
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Science  2.0 

This  term  refers  to  the  next  generation  of  collaborative  science 
enabled  through  IT,  the  internet  and  mobile  devices,  which 
is  needed  to  solve  complex,  global  interdisciplinary  problems 
(Shneiderman,  2008). 

Self-knowledge 

The  term  was  given  by  Pedresci  (2015)  in  a different  way  than 
the  usual  philosophical.  He  supported  engagement  of  people 
in  the  creation  and  use  of  big  data  and  knowledge,  by  empowe- 
ring individuals  with  self-knowledge  and  tools  so  that  they 
may  integrate  their  own  digital  breadcrumbs  into  meaningful 
knowledge  and  boost  self-awareness  of  own  behavioral  pat- 
terns: health,  mobility,  consumptions  etc. 

Spatial  Citizenship 

Spatial  Citizenship  as  a term  combines  the  spatial  awareness 
in  a societal  framework  by  using  geo-media  and  focuses  on 
the  information  production  and  communication.  Elwood  & 
Mitchell  (2013)  by  making  a further  step,  relate  the  neogeo- 
graphy to  the  political  formation  which  leads  to  the  spatial 
citizenship.  The  term  gains  particular  importance  through  the 
emergence  of  the  Geoweb. 

Spatially 

Explicit  Web  2.0 
Applications 

Spatially  Explicit  Web  2.0  Applications  refer  to  Web  appli- 
cations that  are  designed  and  built  according  to  the  Web  2.0 
principles  (O’  Reilly,  2005)  but  also  drive  their  contributors  to 
use  geography  and  location  as  a motivational  and  organisa- 
tional factor  and  explicitly  urge  them  to  interact  with  spatial 
entities.  According  to  Antoniou  et.  al  (2010)  spatially  explicit 
applications  urge  their  contributors  to  interact  directly  with 
spatial  features  (i.e.  to  capture  spatial  entities  in  their  photos) 
while  at  the  same  time  encourage  that  photos,  and  thus  the 
content,  be  spatially  distributed. 

Spatially 

Implicit  Web  2.0 
Applications. 

Refer  to  Web  applications  that  are  designed  and  built  accor- 
ding to  the  Web  2.0  principles  (O’  Reilly,  2005)  but  with  no 
explicit  reference  to  space.  Space  is  just  one  of  the  many  intere- 
sting features  that  the  applications  have  but  spatial  information 
is  neither  one  of  the  core  features  nor  is  it  the  main  motivation 
of  their  users.  According  to  Antoniou  (2011),  these  are  more 
socially  oriented,  as  they  are  aiming  to  allow  people  to  share 
their  photo  albums  with  no  explicit  reference  to  space,  and 
thus  are  regarded  as  spatially  implicit  Web  applications. 

Swarm  Intelligence 

This  terms  appears  in  Buecheler  and  Sieg  (2011)  as  a ‘buz- 
zword’ for  paradigms  like  citizen  science,  crowdsourcing  and 
open  innovation,  among  others. 

Trust  of  VGI 

Trust  of  VGI  may  be  applied  similarly  to  credibility  that  first 
explored  by  Flanagin  & Metgzer  (2008)  and  it  is  the  general 
term  where  quality  criteria  and  techniques  fall  under. 
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User-generated  content  is  also  associated  to  the  users  trust  in 
the  content  (a  subjective  concept)  which  leads  to  a connection 
between  content  quality  and  provider’s  authority  Goodchild  and 
Li  (2010)  describe  three  approaches  to  quality  assurance;  the 
crowd- sourcing,  social,  and  geographic  approaches  respectively: 
Crowdsourcing  approach:  How  a group  of  people  may  con- 
verge on  the  solution  to  a problem  that  an  individual,  even  an 
expert,  may  be  unable  to  solve.  It  also  refers  to  the  ability  of  the 
crowd  to  converge  on  the  truth  (Linus’  law). 

The  social  approach:  It  relies  on  a hierarchy  of  trusted  indivi- 
duals who  act  as  moderators  or  gate-keepers. 

The  geographical  approach:  It  relies  on  a comparison  of  a 
purported  geographic  fact  with  the  broad  body  of  geographic 
knowledge. 

Ubiquitous 

cartography 

This  is  the  study  of  how  maps  are  created  and  used  at  any  time 
or  location  in  way  that  is  much  more  frequent  in  space  and 
time  than  traditional  cartography  (Gartner  et  al.,  2007). 

User  generated 
content 

User  generated  content  is  the  general  term  which  is  applied  in 
various  fields  and  includes  publishing  of  content  by  users  in  a 
digital  format  such  as  data,  videos,  blogs,  maps,  photographs 
and  comments  or  ratings  to  other  UGC,  among  others,  which 
as  been  faciliated  by  Web  2.0  and  mobile  technology  (Moens 
et  al,  2014). 

User  Generated 
Spatial  Content 
(UGSC). 

User  Generated  Spatial  Content  is  an  alternative  term  to  VGI 
as  the  term  “Volunteered”  can  be  misleading  regarding  the 
particularities  of  the  generated  data  and  the  intentions  of  the 
data  providers.  In  a sense  “Volunteered”  implies  a noble  and 
altruistic  gesture  as  if  the  users  donate  the  data,  personal  or 
not,  to  the  world  for  any  use,  known  or  not,  to  the  data  provi- 
der. This  might  not  be  the  case  in  any  dataset  termed  as  VGI. 
Acknowledging  the  issues  raised,  a more  general  but  still  pre- 
cisely descriptive  term  is  used:  User  Generated  Spatial  Content 
(UGSC)  (Antoniou  et  al.,  2009,  Brando  and  Bucher,  2010). 

Vernacular 

geography 

According  to  Hollenstein  & Purves  (2010),  vernacular  geo- 
graphy encloses  the  sense  of  place  that  is  reflected  in  ordinary 
peoples  language  very  different  from  that  captured  by  standard 
geographical  techniques.  In  other  words,  vernacular  geography 
concentrates  on  how  people  name  and  delimit  space  in  everyday 
use.  Waters  & Evans  (2003),  identify  the  origins  of  the  phenome- 
non, as  people  are  not  geographers  but  they  exist  in  a geographi- 
cal world.  So,  much  of  human  beings’  behaviour  is  dictated  by 
how  they  perceive  areas.  Ordnance  Survey  carried  out  a project 
to  study  the  phenomenon  which  leads  people  to  talk  about  their 
world  and  say  where  things  are  within  it  in  an  unofficial  way. 
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Volunteer 

computing 

Volunteer  computing,  according  to  Anderson  and  Fedak, 

(2006)  is  a concept  where  citizens  spend  their  spare  time  to 
solve  scientific  problems  via  the  Internet.  Volunteer  computing 
is  also  referred  to  as  peer-to-peer  or  global  computing  and  is 
applied  in  various  academis  fields  such  as  high-energy  physics, 
molecular  biology,  medicine,  astrophysics,  climate  study,  and 
other  areas. 

Sarmenta  and  Satochi  (1999)  refer  to  the  idea  of  volunteer 
computing,  where  people  form  very  large  parallel  computing 
networks  by  using  ubiquitous  and  easy-to-use  technologies 
such  as  web  browsers  and  Java. 

Volunteer  thinking 

According  to  Munyaradzi  (2013),  volunteer  thinking  is  the 
act  of  using  ones  brain  and  cognitive  skills  to  solve  a problem. 
Volunteer  / Distributed  thinking  (Quinn  and  Bederson,  2011) 
is  the  harnessing  of  human  brain  power  on  the  Internet  to 
solve  problems  that  machines  are  not  suitable  to  tackle.  In 
volunteer  thinking,  users  are  tasked  to  solve  some  fundamental 
problem,  reduced  to  a simplistic  level  that  is  easy  to  com- 
prehend. Using  their  mental  and  cognitive  abilities,  volunteers 
actively  attempt  to  solve  the  problem  at  hand.  The  types  of 
problems  vary  in  nature  e.g.  image  tagging  & classification, 
proof-reading  documents  and  pattern  recognition.  The  tasks 
are  designed  in  such  a manner  that  volunteers  need  no  pre- 
vious experience  to  solve  the  problem.  Anderson  (2006)  also 
developed  the  Bossa  crowd-  sourcing  framework  that  manages 
distributed  Web-based  volunteer  thinking  projects. 

Volunteered 
Geographic 
Information  (VGI) 

The  term  VGI  was  first  coined  by  Goodchild  (2007)  as  the 
voluntary  collection  and  dissemination  of  spatial  infromation 
by  individuals  who  often  have  little  training  or  formal  qualifi- 
cations in  spatial  sciences. 

Web  mapping 

Web  mapping  is  cartographic  representation  on  the  web,  with 
a particular  focus  on  UGC,  user-centred  design  and  ubiquitous 
access  (Tsou,  2011). 

Wikinomics 

Wikinomics  embodies  the  idea  of  mass  collaboration  in  a busi- 
ness environment.  It  is  based  on  four  principles:  a)  openness; 
b)  peering;  c)  sharing;  and  d)  acting  globally.  The  book  itself  is 
meant  to  be  a collaborative  and  living  document  that  everyone 
can  contribute  to  (Tapscott  and  Williams,  2006). 
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This  book  focuses  on  the  study  of  the  remarkable  new  source  of 
geographic  information  that  has  become  available  in  the  form  of  user- 
generated content  accessible  over  the  Internet  through  mobile  and 
Web  applications.  The  exploitation,  integration  and  application  of  these 
sources,  termed  volunteered  geographic  information  (VGI)  or  crowd- 
sourced  geographic  information  (CGI),  offer  scientists  an  unprecedented 
opportunity  to  conduct  research  on  a variety  of  topics  at  multiple 
scales  and  for  diversified  objectives. 

The  Handbook  is  organized  in  five  parts,  addressing  the  fundamental 
questions:  What  motivates  citizens  to  provide  such  information  in  the 
public  domain,  and  what  factors  govern/predict  its  validity? 

What  methods  might  be  used  to  validate  such  information? 

Can  VGI  be  framed  within  the  larger  domain  of  sensor  networks, 
in  which  inert  and  static  sensors  are  replaced  or  combined  by  intelligent 
and  mobile  humans  equipped  with  sensing  devices?  What  limitations 
are  imposed  on  VGI  by  differential  access  to  broadband  Internet, 
mobile  phones,  and  other  communication  technologies,  and  by 
concerns  over  privacy?  How  do  VGI  and  crowdsourcing  enable 
innovation  applications  to  benefit  human  society? 

Chapters  examine  how  crowdsourcing  techniques  and  methods,  and 
the  VGI  phenomenon,  have  motivated  a multidisciplinary  research 
community  to  identify  both  fields  of  applications  and  quality  criteria 
depending  on  the  use  of  VGI.  Besides  harvesting  tools  and  storage  of 
these  data,  research  has  paid  remarkable  attention  to  these 
information  resources,  in  an  age  when  information  and  participation  is 
one  of  the  most  important  drivers  of  development. 

The  collection  opens  questions  and  points  to  new  research  directions  in 
addition  to  the  findings  that  each  of  the  authors  demonstrates.  Despite 
rapid  progress  in  VGI  research,  this  Handbook  also  shows  that  there  are 
technical,  social,  political  and  methodological  challenges  that  require 
further  studies  and  research. 
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